replicate / cog

Containers for machine learning
https://cog.run
Apache License 2.0
7.93k stars 553 forks source link

cannot work with lambdalabs gpu #1612

Open abeatbeyondlab opened 6 months ago

abeatbeyondlab commented 6 months ago

I am following this tutorial https://replicate.com/docs/guides/get-a-gpu-machine

I run sudo cog predict r8.im/stability-ai/stable-diffusion@sha256:ac732df83cea7fff18b8472768c88ad041fa750ff7682a21affe81863cbe77e4 -i prompt="a pot of gold"

And getting the following error :


Starting Docker image r8.im/stability-ai/stable-diffusion@sha256:ac732df83cea7fff18b8472768c88ad041fa750ff7682a21affe81863cbe77e4 and running setup()...
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/root/.pyenv/versions/3.11.4/lib/python3.11/site-packages/cog/server/http.py", line 354, in <module>
    app = create_app(
          ^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.4/lib/python3.11/site-packages/cog/server/http.py", line 71, in create_app
    predictor = load_predictor_from_ref(predictor_ref)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.4/lib/python3.11/site-packages/cog/predictor.py", line 155, in load_predictor_from_ref
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/src/predict.py", line 17, in <module>
    from dynamic_sd.src.pipeline_stable_diffusion_ait_alt import StableDiffusionAITPipeline
  File "/src/dynamic_sd/src/pipeline_stable_diffusion_ait_alt.py", line 40, in <module>
    from .compile_lib.compile_vae_alt import map_vae
  File "/src/dynamic_sd/src/compile_lib/compile_vae_alt.py", line 21, in <module>
    from ..modeling.vae import AutoencoderKL as ait_AutoencoderKL
  File "/src/dynamic_sd/src/modeling/vae.py", line 22, in <module>
    from .unet_blocks import get_up_block, UNetMidBlock2D
  File "/src/dynamic_sd/src/modeling/unet_blocks.py", line 36, in <module>
    from .clip import SpatialTransformer
  File "/src/dynamic_sd/src/modeling/clip.py", line 24, in <module>
    USE_CUDA = detect_target().name() == "cuda"
               ^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.4/lib/python3.11/site-packages/aitemplate/testing/detect_target.py", line 132, in detect_target
    raise RuntimeError("Unsupported platform")
RuntimeError: Unsupported platform
ⅹ Failed to get container status: exit status 1

image

any feedback?

ReadCommitPush commented 5 months ago

i am getting the same issue

abeatbeyondlab commented 5 months ago

any feedback on this ?

Jordan-Lambda commented 5 months ago

Hi,

I am an SWE working for Lambda, and I decided to look into this problem. I know next to nothing about cog, and following the directions linked in the original report I can confirm that I can reproduce the problem.

I did find that the following steps on a freshly launched instance successfully generated a file output.0.png though:

  1. git clone https://github.com/replicate/cog-stable-diffusion.git
  2. cd cog-stable-diffusion/
  3. sudo cog run script/download-weights && clear (output from the script left my terminal in a bad state, hence the clear)
  4. sudo cog predict -i prompt="a pot of gold"

Is the version of CUDA provided by Lambda Stack not supported? I ask because the first line of output from that last command is the following: ⚠ Cog doesn't know if CUDA 11.8 is compatible with PyTorch 1.13.0. This might cause CUDA problems.

Note that I don't know where the "CUDA 11.8" is coming from:

Mon May  6 23:53:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10                     On  | 00000000:08:00.0 Off |                    0 |
|  0%   36C    P8              16W / 150W |      3MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

If there is anything that I can do to help troubleshoot this, or if there's a change to our on-demand VM base image that might prevent this in the future, please let me know.

abeatbeyondlab commented 4 months ago

no news on this from replicate team?

alessandromorandi commented 4 months ago

hey I have the same issue here! any news? @Jordan-Lambda