replicate / cog

Containers for machine learning
https://cog.run
Apache License 2.0
8.08k stars 562 forks source link

RuntimeError: Found no NVIDIA driver on your system #347

Closed zeke closed 2 years ago

zeke commented 2 years ago

Not sure if this is a bug in Cog, or if Cog users are expected to set up NVIDIA drivers manually.

I'm getting this error trying to run a prediction on the cjwbw/rudalle-sr model using my Google Cloud GPU instance:

z@zeke-dev-gpu:~$ cog --debug predict "r8.im/cjwbw/rudalle-sr@sha256:cf62c87dde3b7a9f0999519f291d7d4f84e5d1883cfa0c986ae79d8a92247966" -i image=@1.png -i scale=4

Starting Docker image r8.im/cjwbw/rudalle-sr@sha256:cf62c87dde3b7a9f0999519f291d7d4f84e5d1883cfa0c986ae79d8a92247966 and running setup()...
$ docker run --rm --shm-size 8G --detach --publish 5081:5000 r8.im/cjwbw/rudalle-sr@sha256:cf62c87dde3b7a9f0999519f291d7d4f84e5d1883cfa0c986ae79d8a92247966
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/site-packages/cog/server/http.py", line 106, in <module>
    server.start_server()
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/site-packages/cog/server/http.py", line 74, in start_server
    app = self.make_app()
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/site-packages/cog/server/http.py", line 22, in make_app
    self.predictor.setup()
  File "/src/predict.py", line 19, in setup
    model.load_weights(f'models/RealESRGAN_x{scale}.pth')
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/site-packages/rudalle/realesrgan/model.py", line 27, in load_weights
    self.model.to(self.device)
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 612, in to
    return self._apply(convert)
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 359, in _apply
    module._apply(fn)
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 381, in _apply
    param_applied = fn(param)
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/site-packages/torch/nn/modules/module.py", line 610, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
ⅹ Failed to get container status: exit status 1

My environment:

z@zeke-dev-gpu:~$ uname -a
Linux zeke-dev-gpu 4.19.0-18-cloud-amd64 #1 SMP Debian 4.19.208-1 (2021-09-29) x86_64 GNU/Linux
zeke commented 2 years ago
z@zeke-dev-gpu:~$ cog --version
cog version 0.0.13 (built 2021-09-28T15:38:06Z)

Ooh I think this was maybe fixed in a recent version of Cog...

zeke commented 2 years ago

Yep. Upgraded to cog@0.0.18 and all is well!

homer-simpson-bush-gif

andreemic commented 1 year ago
Screenshot 2023-06-22 at 21 47 47

getting the same...

andreemic commented 1 year ago
Screenshot 2023-06-22 at 21 50 24

If i try to cog run i get this

andreemic commented 1 year ago
Screenshot 2023-06-22 at 21 50 54

GPU works fine in pytorch

mattt commented 1 year ago

@andreemic What version of Cog are you using? If this isn't working for you try updating to the latest release (v0.8.0-beta6).