TORCH_USE_CUDA_DSA error in recent update

SoftologyPro commented 10 months ago

Running app.py locally (Windows). UI opens but when one of the sample prompts is clicked it errors out with this message

self.timesteps = torch.from_numpy(timesteps.copy()).to(device=device, dtype=torch.long)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Any ideas on what needs to be done to fix this and get it working again?

To setup a local environment I use these packages/versions

python -m pip install --upgrade pip
pip install --no-cache-dir --ignore-installed --force-reinstall --no-warn-conflicts wheel==0.38.4
pip install --no-cache-dir --ignore-installed --force-reinstall --no-warn-conflicts transformers==4.34.1
pip install --no-cache-dir --ignore-installed --force-reinstall --no-warn-conflicts accelerate==0.23.0
pip install --no-cache-dir --ignore-installed --force-reinstall --no-warn-conflicts gradio==3.48.0
pip install --no-cache-dir --ignore-installed --force-reinstall --no-warn-conflicts diffusers==0.22.3
pip uninstall -y torch
pip install --no-cache-dir --ignore-installed --force-reinstall --no-warn-conflicts torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 --index-url https://download.pytorch.org/whl/cu118
pip uninstall -y typing_extensions
pip install --no-cache-dir --ignore-installed --force-reinstall --no-warn-conflicts typing_extensions==4.8.0

Here is the pip list just in case that helps

Package                   Version
------------------------- ------------
accelerate                0.23.0
aiofiles                  23.2.1
altair                    5.1.2
annotated-types           0.6.0
anyio                     3.7.1
attrs                     23.1.0
certifi                   2023.7.22
charset-normalizer        3.3.2
click                     8.1.7
colorama                  0.4.6
contourpy                 1.2.0
cycler                    0.12.1
diffusers                 0.22.3
exceptiongroup            1.1.3
fastapi                   0.104.1
ffmpy                     0.3.1
filelock                  3.13.1
fonttools                 4.44.0
fsspec                    2023.10.0
gradio                    3.48.0
gradio_client             0.6.1
h11                       0.14.0
httpcore                  1.0.1
httpx                     0.25.1
huggingface-hub           0.17.3
idna                      3.4
importlib-metadata        6.8.0
importlib-resources       6.1.1
Jinja2                    3.1.2
jsonschema                4.19.2
jsonschema-specifications 2023.7.1
kiwisolver                1.4.5
MarkupSafe                2.1.3
matplotlib                3.8.1
mpmath                    1.3.0
networkx                  3.2.1
numpy                     1.26.1
orjson                    3.9.10
packaging                 23.2
pandas                    2.1.2
Pillow                    10.1.0
pip                       23.3.1
psutil                    5.9.6
pydantic                  2.4.2
pydantic_core             2.10.1
pydub                     0.25.1
pyparsing                 3.1.1
python-dateutil           2.8.2
python-multipart          0.0.6
pytz                      2023.3.post1
PyYAML                    6.0.1
referencing               0.30.2
regex                     2023.10.3
requests                  2.31.0
rpds-py                   0.12.0
safetensors               0.4.0
semantic-version          2.10.0
setuptools                63.2.0
six                       1.16.0
sniffio                   1.3.0
starlette                 0.27.0
sympy                     1.12
tokenizers                0.14.1
toolz                     0.12.0
torch                     2.0.1+cu118
torchaudio                2.0.2+cu118
torchvision               0.15.2+cu118
tqdm                      4.66.1
transformers              4.34.1
typing_extensions         4.8.0
tzdata                    2023.3
urllib3                   2.0.7
uvicorn                   0.24.0.post1
websockets                11.0.3
wheel                     0.38.4
zipp                      3.17.0

luosiallen commented 10 months ago

which gpus are you using?

luosiallen commented 10 months ago

maybe try this link: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/8965#issuecomment-1530085758

SoftologyPro commented 10 months ago

which gpus are you using?

Single 4090

SoftologyPro commented 10 months ago

maybe try this link: AUTOMATIC1111/stable-diffusion-webui#8965 (comment)

I want to run this "stand alone" outside Web UI. Just from the command line.

This error seems to have started once LCM support was added to the diffusers repo. Prior to this when diffusers was a sub folder it all worked fine.

Also, this is not just me. Another user reported the problem to me and I verified the same error and then raised this issue.

INeedACocaineNinja commented 10 months ago

Yeah hi. As for me (the mentioned user) I have a 4080 and some integrated graphics on the CPU AMD Ryzen 9 7950X. Running the torch cuda get device count function returns only 1 tho, and it does work with other projects (like eg. Illusion Diffusion).

vantang commented 10 months ago

same issue

vantang commented 10 months ago

I think I've solved this issue, although I've solved the problem , I still have a question to ask @luosiallen

here is my solution below, and my question at the end of the texts.

I try to deploy app.py in Linux or windows and MacOS，all environment report this issue.

but I can run interfere by using ur sample code like this

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7")

# To save GPU memory, torch.float16 can be used, but it may compromise image quality.
pipe.to(torch_device="cuda", torch_dtype=torch.float32)

prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"

# Can be set to 1~50 steps. LCM support fast inference even <= 4 steps. Recommend: 1~8 steps.
num_inference_steps = 4 

images = pipe(prompt=prompt, num_inference_steps=num_inference_steps, guidance_scale=8.0, lcm_origin_steps=50, output_type="pil").images

for save the image which i generate, i modify the sample code, just like this

from diffusers import DiffusionPipeline
import torch
from torchvision.transforms import ToPILImage

# 创建 DiffusionPipeline 实例
pipe = DiffusionPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7")

# 设置管道的设备和数据类型
pipe.to(torch_device="cuda", torch_dtype=torch.float32)

# 输入的提示
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

# 可以设置 1~50 步。LCM 支持快速推断，甚至可以 <= 4 步。推荐: 1~8 步。
num_inference_steps = 4 

# 生成图像
images = pipe(prompt=prompt, num_inference_steps=num_inference_steps, guidance_scale=8.0, lcm_origin_steps=50, output_type="pil").images

# 保存生成的每个图像
for i, img_pil in enumerate(images):
    # 保存图片
    img_pil.save(f"generated_image_{i+1}.png")

print("Images saved successfully.")

it also works, and the image save successful, so, i think my python envs is OK.

I go to view huggingface demo, and try to find out any difference.

I noticed that line 38 and line 39

I saw @luosiallen comment line 39

pipe = DiffusionPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7")
# pipe = DiffusionPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7", custom_pipeline="latent_consistency_txt2img", custom_revision="main")

I think this is the key point, so I try to using

pipe = DiffusionPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7")

It works perfect.....MacOS、Windows、Linux all works.

I just wonder know.... is there something wrong in custom_pipline?

luosiallen commented 10 months ago

thanks, i forgot to update the pipeline. The previous custom_pipeline is deprecated.

Zero0Alex commented 8 months ago

I am very new to Stable diffusion so can someone help me? i have this issue where if I do like 1024x512 is ok but if I use hifix or a higher width or height them 1024 I get "TORCH_USE_CUDA_DSA" i dont know how to post the log without leaving a bad format , so i put in a txt in the end of my comment sorry for the trouble, i already read some discussions here and tried some fix like installing torch again. but nothing worked for me

Traceback (most recent call last): File "C:\Users\Denis\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1016, in _bootstrap_inner Error completing request Arguments: ('task(bs8ydtxhkdvyoag)', 'curvy Tear Grants\nnude, smiling, blushing, detailed eyes, chubby\nwith hairy armpit, large breasts, smelly pussy, smell, sweat, dripping wet\n pubic hair, armpit hair,\nat an onsen,\nsuper wide lens,\nbacklight\nmasterpiece, hyperdetailed, 8k, 4k, highres, detailed background\n , pubic hair', 'easynegative', [], 25, 'DPM++ 2M Karras', 1, 1, 7, 1920, 1280, False, 0.7, 2, 'Latent', 0, 0, 0, 'Use same checkpoint', 'Use same sampler', '', '', [], <gradio.routes.Request object at 0x000001E2E8BF39A0>, 0, False, '', 0.8, -1, False, -1, 0, 0, 0, False, False, 'positive', 'comma', 0, False, False, 'start', '', 1, '', [], 0, '', [], 0, '', [], True, False, False, False, 0, False) {} Traceback (most recent call last): File "C:\Stable\stable-diffusion-webui\modules\call_queue.py", line 57, in f res = list(func(*args, *kwargs)) File "C:\Stable\stable-diffusion-webui\modules\call_queue.py", line 36, in f res = func(args, **kwargs) File "C:\Stable\stable-diffusion-webui\modules\txt2img.py", line 55, in txt2img processed = processing.process_images(p) File "C:\Stable\stable-diffusion-webui\modules\processing.py", line 734, in process_images res = process_images_inner(p) File "C:\Stable\stable-diffusion-webui\modules\processing.py", line 875, in process_images_inner x_samples_ddim = decode_latent_batch(p.sd_model, samples_ddim, target_device=devices.cpu, check_for_nans=True) File "C:\Stable\stable-diffusion-webui\modules\processing.py", line 600, in decode_latent_batch devices.test_for_nans(sample, "vae") File "C:\Stable\stable-diffusion-webui\modules\devices.py", line 131, in test_for_nans if not torch.all(torch.isnan(x)).item(): RuntimeError: CUDA error: the launch timed out and was terminated CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

self.run()

File "C:\Stable\stable-diffusion-webui\modules\memmon.py", line 53, in run

free, total = self.cuda_mem_get_info()

File "C:\Stable\stable-diffusion-webui\modules\memmon.py", line 34, in cuda_mem_get_info return torch.cuda.mem_get_info(index) File "C:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\cuda\memory.py", line 618, in mem_get_info return torch.cuda.cudart().cudaMemGetInfo(device) Traceback (most recent call last): RuntimeError: CUDA error: the launch timed out and was terminated CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. File "C:\Stable\stable-diffusion-webui\venv\lib\site-packages\gradio\routes.py", line 488, in run_predict output = await app.get_blocks().process_api( File "C:\Stable\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 1431, in process_api result = await self.call_function(

File "C:\Stable\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 1103, in call_function prediction = await anyio.to_thread.run_sync( File "C:\Stable\stable-diffusion-webui\venv\lib\site-packages\anyio\to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "C:\Stable\stable-diffusion-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "C:\Stable\stable-diffusion-webui\venv\lib\site-packages\anyio_backends_asyncio.py", line 807, in run result = context.run(func, args) File "C:\Stable\stable-diffusion-webui\venv\lib\site-packages\gradio\utils.py", line 707, in wrapper response = f(args, **kwargs) File "C:\Stable\stable-diffusion-webui\modules\call_queue.py", line 77, in f devices.torch_gc() File "C:\Stable\stable-diffusion-webui\modules\devices.py", line 61, in torch_gc torch.cuda.empty_cache() File "C:\Stable\stable-diffusion-webui\venv\lib\site-packages\torch\cuda\memory.py", line 133, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: the launch timed out and was terminated CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. Logcudaerror.txt

luosiallen / latent-consistency-model

TORCH_USE_CUDA_DSA error in recent update #34

File "C:\Stable\stable-diffusion-webui\modules\memmon.py", line 53, in run