replicate / cog

Containers for machine learning
https://cog.run
Apache License 2.0
7.46k stars 513 forks source link

Large models timeout on download #1459

Open Vochsel opened 6 months ago

Vochsel commented 6 months ago

When running cog predict on large models (SDXL for example), users with slow internet connections, or far away from weight storage (Australia seems to be quite far from r8.im storage), experience timeouts when running cog.

Example command: cog predict r8.im/stability-ai/sdxl@sha256:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b -i prompt="A bunny" --debug

Output:

Checking for updates...
$ docker image inspect r8.im/stability-ai/sdxl@sha256:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b

Starting Docker image r8.im/stability-ai/sdxl@sha256:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b and running setup()...
$ docker run --rm --shm-size 8G --detach --env COG_LOG_LEVEL=debug --gpus all --publish 0:5000 r8.im/stability-ai/sdxl@sha256:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b
result of update check:
{"logger": "torch.distributed.nn.jit.instantiator", "timestamp": "2023-12-31T00:12:18.880798Z", "severity": "INFO", "message": "Created a temporary directory at /tmp/tmp6ian2rtl"}
{"logger": "torch.distributed.nn.jit.instantiator", "timestamp": "2023-12-31T00:12:18.881083Z", "severity": "INFO", "message": "Writing /tmp/tmp6ian2rtl/_remote_module_non_scriptable.py"}
{"logger": "uvicorn.error", "timestamp": "2023-12-31T00:12:19.798613Z", "severity": "INFO", "message": "Started server process [8]"}
{"logger": "uvicorn.error", "timestamp": "2023-12-31T00:12:19.798739Z", "severity": "INFO", "message": "Waiting for application startup."}
{"logger": "uvicorn.error", "timestamp": "2023-12-31T00:12:19.801408Z", "severity": "INFO", "message": "Application startup complete."}
{"logger": "uvicorn.error", "timestamp": "2023-12-31T00:12:19.801978Z", "severity": "INFO", "message": "Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)"}
Loading safety checker...
downloading url:  https://weights.replicate.delivery/default/sdxl/safety-1.0.tar
downloading to:  ./safety-cache
ⅹ Timed out

Is there any way to increase this timeout?

Thanks!

Vochsel commented 3 months ago

This is still an issue in the latest cog... Makes cog impossible to use locally...

idootop commented 1 month ago

same issue