huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.7k stars 5.31k forks source link

Feedforward Output Inconsistency with Varying Sequence Lengths #8809

Closed soryxie closed 3 months ago

soryxie commented 3 months ago

Describe the bug

When using the Feedforward in diffusers.models.attention. I've observed discrepancies in the results when processing subsets of the original input that vary in sequence length. However, I think the Feedforward mechanism should operate independently of sequence length?

Reproduction

import torch
import diffusers
from diffusers.models.attention import FeedForward

module = FeedForward(1536, dropout=0).to("cuda")

with torch.no_grad():
    inp = torch.randn(2, 2048, 1536).to("cuda")
    ref = module(inp)
    ref = ref[:, :1400, :]
    out = module(inp[:, :1400, :])

assert torch.allclose(ref, out)

Logs

assert torch.allclose(ref, out)
AssertionError


### System Info

- 🤗 Diffusers version: 0.30.0.dev0  (also can reproduce in released v0.29.0-v0.30.0)
- Platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.35
- Running on a notebook?: No
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.23.4
- Transformers version: 4.42.3
- Accelerate version: 0.32.1
- PEFT version: not installed
- Bitsandbytes version: not installed
- Safetensors version: 0.4.3
- xFormers version: not installed
- Accelerator: Tesla V100-SXM2-16GB, 16384 MiB
Tesla V100-SXM2-16GB, 16384 MiB
Tesla V100-SXM2-16GB, 16384 MiB
Tesla V100-SXM2-16GB, 16384 MiB VRAM
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No

### Who can help?

@yiyixuxu @sayakpaul @DN6 @asomoza
@DN6 @yiyixuxu @sayakpaul
Thanks for helping me!!!!
tolgacangoz commented 3 months ago

When workloads are different, PyTorch doesn't guarantee reproducibility. See this discussion: https://discuss.pytorch.org/t/different-outputs-when-using-different-batch-size-only-on-cuda

soryxie commented 3 months ago

Thanks for helping.