replicate / cog

Containers for machine learning
https://cog.run
Apache License 2.0
8.01k stars 560 forks source link

Getting started docs on WSL2 - does not work on RTX 3070 #692

Open ktos opened 2 years ago

ktos commented 2 years ago

I found cog today, and it seemed as a perfect way to run some models to play with without CUDA/Python/torch/pip/conda/virtualenv version swamp.

I have done step-by-step things according to the https://github.com/replicate/cog/blob/main/docs/wsl2/wsl2.md (had WSL2 installed earlier, but removed the distro and installed fresh one).

It says that "RTX 2000/3000 series, Kesler/Tesla/Volta/Ampere series" are supported, I have RTX 3070, which I believe is "Ampere".

Yet running cog predict 'r8.im/afiaka87/glid-3-xl' -i prompt="a fresh avocado floating in the water" -o prediction.json returns a big stacktrace saying:

NVIDIA GeForce RTX 3070 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3070 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
.......
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ⅹ /predictions call returned status 500

I believed the point of cog is to find matching version of pytorch for particular CUDA version, but it is not working?

Or RTX 3070 is not supported actually?

bfirsh commented 2 years ago

@afiaka87 any ideas?

afiaka87 commented 2 years ago

@afiaka87 any ideas?

Hm, that's not good. This must be from pytorch being too high - the T4's I test on actually support a newer version of CUDA and pytorch than the Ampere series.

In cog.yaml check the torch, torchvision versions.

For ampere, I think it needs to be:

build:
    cuda: "11.1"
    python_packages:
      # ...
      - torch==1.10.1
      - torchvision==0.11.2

I'm away from my computer but I'll look into uploading a (hopefully) working version.

Have you tried any other models in the meantime @ktos ?

ktos commented 2 years ago

The same problem is for laion-ai/erlich (which I tried initially, because I was the most interested in it).

ktos@DESKTOP-KFNTTB2:/mnt/c/Users/admin$ cog predict r8.im/laion-ai/erlich@sha256:a51ce279c0131991c5a143a9c6a3ec6de146e765d9311cff7435b1db1190faaa -i prompt=test

Starting Docker image r8.im/laion-ai/erlich@sha256:a51ce279c0131991c5a143a9c6a3ec6de146e765d9311cff7435b1db1190faaa and running setup()...
Loading latent diffusion model from erlich_fp16.pt
/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/torch/cuda/__init__.py:146: UserWarning:
NVIDIA GeForce RTX 3070 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3070 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Loading VAE from kl-f8.pt
Loading CLIP text encoder from textual.onnx
[CLIP ONNX] Load mode
Loading BERT text encoder from bert.pt
Running prediction...
Using seed 3165870565
Running simulation for test
Encoding text embeddings with test dimensions
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/fastapi/applications.py", line 269, in __call__
    await super().__call__(scope, receive, send)
  File "/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/starlette/applications.py", line 124, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/starlette/exceptions.py", line 93, in __call__
    raise exc
  File "/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/starlette/exceptions.py", line 82, in __call__
    await self.app(scope, receive, sender)
  File "/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
.............
  File "/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/src/encoders/x_transformer.py", line 609, in forward
    x = self.token_emb(x)
  File "/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/root/.pyenv/versions/3.8.13/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ⅹ /predictions call returned status 500

Same for laion-ai/ongo and pixray/text2image, so I believe there is something wrong with my setup.

On the other hand borisdayma/dalle-mini seems to be working, however throwing OOM ;)

afiaka87 commented 2 years ago

@ktos I've just updated afiaka87/glid-3-xl with a version I'm hoping will be compatible with your card. Could you give it a shot again? Then I can update erlich with the same.

For future reference, I help maintain the repository for erlich/ongo. You can (and should!) build an image from scratch if you want to - https://github.com/LAION-AI/ldm-finetune

ktos commented 2 years ago

@afiaka87 It has taken me a few days, but I checked it out, and everything is working now. Thank you!

okanji commented 1 year ago

How did you solve this? I have the same issue

arnavmehta7 commented 1 year ago

Hmmm, I have a 3050 but I used to get this error when my installation of the pytorch was incorrect/not-supportable. I'll suggest you to install torch using miniconda/anaconda as it'll manage the right cuda/cudnn version for you. Otherwise check "light-the-torch" it'll get things done :)