Closed ttio2tech closed 1 year ago
Does this happen for all prompts you tried so far?
@patrickvonplaten @pcuenca Cc.
This bug happens in the UNet step and it differs from sampler to sampler. It takes input latent in correct format, but throws out tensor of the same shape full of NANs. From my testing, this happens quite often with DPM samplers, while EulerA is quite stable.
Hmm. Could you ensure you're on the latest Torch 2.0 install?
Im a VoltaML developer, we are using Stable release of Pytorch 2.0 with CUDA 11.8, in containerized environment, so nothing should interfere.
I will let @pcuenca comment here.
@ttio2tech ,
could you add a reproducible code snippet here?
The code that caused the error:
import torch from diffusers import StableDiffusionPipeline pipe = StableDiffusionPipeline.from_pretrained("hakurei/waifu-diffusion", torch_dtype=torch.float16).to( "cuda" ) pipe.unet = torch.compile(pipe.unet) # this step gave error
Does it give warning or error?
If it's an error, it could be because of dependency problems. Could you post the error snippet?
Also seeing this problem a bit now - @pcuenca , @patil-suraj did you ever encounter this?
I'm confused, the issue title and @ttio2tech mention a ROCm environment, but @Stax124 says they are using CUDA. Does this issue only apply to ROCm/AMD cards, or is this a general problem with torch 2?
I haven't seen it myself, but will test the model mentioned by @ttio2tech (on CUDA, unfortunately I have no AMD cards compatible with ROCm).
It's a bit hard to produce a minimal example due to the various dependencies involved, but I think I've been able to narrow the issue down a bit. I'm running diffusers 0.16.1 with a custom build of PyTorch 2.1 against ROCm 5.5.0 (currently needed for AMD RDNA3 support). The GPU is an RX 7900 XTX. FP32 works fine in all cases, but with FP16 I get the NaN issue when using the model https://huggingface.co/waifu-diffusion/wd-1-5-beta2. With https://huggingface.co/hakurei/waifu-diffusion, FP16 is ok as well.
I've attached a dump of the pipelines and a diff here: https://gist.github.com/onitake/1534ea2eedecb5d346fc70ba2a278d81
What stands out to me is that the working pipeline (WD-1.4) uses CLIPImageProcessor
while the non-working one (WD-1.5) uses CLIPFeatureExtractor
, and that WD-1.4 sets PNMDScheduler.prediction_type='epsilon'
, while in WD-1.5 it's PNMDScheduler.prediction_type='v_prediction'
.
The code to trigger the error is very basic:
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
pretrained_model_name_or_path="../wd-1-5-beta2/",
torch_dtype=torch.float16,
).to('cuda')
generator = torch.Generator("cuda"). manual_seed(0)
result = pipe(prompt="sunflower", generator=generator)
for image in result.images:
image.save("test.png")
I've tried a few other configurations as well, but the result is the same.
Is that enough to investigate the issue, or is there anything else I can try/provide?
Edit: I tried modifying the scheduler configuration to use prediction_type='epsilon'
, but that still resulted in NaNs.
Edit 2: CLIPFeatureExtractor
turned out to be a red herring too. Modifying the pipeline description to use CLIPImageProcessor
instead didn't help.
I traced a failed run through to https://github.com/huggingface/diffusers/blob/v0.16.1/src/diffusers/models/attention_processor.py#L743
This will produce some NaN values in the result, and as soon as that happens, they'll quickly propagate through the whole hidden_states tensor at the end of the iteration.
Maybe related: https://pytorch.org/docs/main/notes/numerical_accuracy.html (I tried some those environment vars and flags, but they didn't help).
After experimenting with https://pytorch.org/docs/main/generated/torch.nn.functional.scaled_dot_product_attention.html a bit, I found that my Torch build only includes the C++ implementation. Maybe this is the problem, or maybe the optimized algorithms are hiding the NaNs.
Unfortunately, there doesn't seem to be a ROCm implementation of the FlashAttention algorithm yet. AMD is still working on it: https://github.com/ROCmSoftwarePlatform/flash-attention/pull/1
In any case, I found that I can hide the issue by adding the following line after the F.scaled_dot_product_attention()
call:
hidden_states = torch.nan_to_num(hidden_states)
It didn't look like it had much of a performance impact on my machine. The algorithm ran at about the same speed as WD-1.4 without this line.
Thanks for sharing your details insights here. If this issue comes up multiple times, I think we might fix with the following (referring users to this issue thread):
torch.nan_to_num(hidden_states)
But if this behavior could be reproduced minimally with just vanilla PyTorch code on your setup, that'd probably be more helpful for the PyTorch to understand the lower-level reasons behind this.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I have the same issue on M1 with https://huggingface.co/stabilityai/stable-diffusion-2-1 model
I was testing the https://huggingface.co/docs/diffusers/optimization/torch2.0 examples, The normal 'Accelerated Transformers implementation' was successful. But the torch.compile example: pipe.unet = torch.compile(pipe.unet) image = pipe(prompt).images[0] gives warning: pipelines/pipeline_utils.py:1023: RuntimeWarning: invalid value encountered in cast images = (images * 255).round().astype("uint8") Resulting in black images.