Performance Issue with RTX 4090 and all SD/Diffusers versions

Marcophono2 commented 1 year ago

Describe the bug

Hello!

Since 10 days, nearly round the clock, I try to bring my brand new and proudly owned Geforce RTX 4090 graphic cards to work appropriate with Stable Diffusion. But finally, 10 days later at least, it is still around 50% below its options.

In that 240 hours I changed from Ubuntu to Manjaro (and from Manjaro back to Ubuntu, via Pop OS back again to Ubuntu and to Manjaro Nightly, containing all Nvidia support, more or less working). Ubuntu absolutely did not allow me to bring together 22.10, 22.04 or 20.04 with my AMD hardware

Threadripper Pro 3955WX ASUS PRO WS WRX80E-SAGE

and my graphic card RTX 4090.

Yes, really, it was not the 4090. It was the mainboard and the cpu which made the big trouble since one of the newer Ubuntu OS versions. Or, the other way round: Ubuntu is the (damn) trouble maker. After about 50 re-installations I replaced the 4090 with a Geforce 2070, started from the scratch and found myself again (and again) in the same position: Yelling and cursing! Still that same issues.

Meanwhile, yes better now, I could bring together Manjaro with CUDA 11.8, Nvidia Driver Version: 520.56.06, Cuda compilation tools 11.8, V11.8.89, (build cuda_11.8.r11.8/compiler.31833905_0)

and used the nightly PyTorch version 1.13

Benchmark results:

with RTX 3090 (512x512) standard, fp16 12.7 it/s

with RTX 3090 (512x512) fp16, prepared unet optimization 14.9 it/s

with RTX 4090 (512x512) fp16, with or without optimization 11.5 it/s

So, at the end (of my long issue description) there is still the question: Why is SD, by all software and driver support so weak on a RTX 4090 compared to a RTX 3090?

I know that xFormers is an impressing performance boost and advantage lately. But I excluded xFormers in m< banchmarks.

Can anyone help me? I am frustrated meanwhile. If someone can help me to fix this missing link between SD, Nvidia support , PyTorch and my hardware, I am generous.

Best regards Marc

Reproduction

No response

Logs

No response

System Info

Manjaro Nightly

Threadripper Pro 3955WX ASUS PRO WS WRX80E-SAGE

CUDA 11.8, Nvidia Driver Version: 520.56.06, Cuda compilation tools 11.8, V11.8.89, (build cuda_11.8.r11.8/compiler.31833905_0)

nightly PyTorch version 1.13

patrickvonplaten commented 1 year ago

Uff I don't really know the details of GPU hardware enough here sadly @NouamaneTazi do you have a hunch maybe? :-)

Marcophono2 commented 1 year ago

@NouamaneTazi , do you have a hunch? :)

NouamaneTazi commented 1 year ago

I never worked with Pytorch 1.13 nor CUDA 11.8 before. Does the 3 benchmarks use the same environment @Marcophono2?

Marcophono2 commented 1 year ago

Yes, they did, @NouamaneTazi . Meanwhile I found out that the RP https://github.com/AUTOMATIC1111/stable-diffusion-webui offers a technical support for the 4090. I can produce a 640x640px image with 22it/s. I used this unusual size because with 512x512 my GPU utilization is only 60%. There is no other bottleneck. So I could enlarge the size until the utilization is nearly 100% without losing speed. So, the performance is great but this is a Windows solution only and at the moment only with a web UI. Not so good for command line based processing with multi GPU platform.

NouamaneTazi commented 1 year ago

I'm afraid I can't help much myself as I don't have access to any Lovelace GPU. It seems that updating cudnn should help speed up inference. I would recommend you follow the following threads: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2449 and https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/2537. The same should apply to diffusers

Marcophono2 commented 1 year ago

@NouamaneTazi , yes, I know that threads and followed every detail. But finally it was not able to build it in the same, or in a similar way, for Linux. Oh, no problem that you don't have access to a Lovelace GPU. Feel free to use my vasti account and book a 4090 instance there. Right at this moment I am using one. My billing account there is good feeded, so you can use a 4090 instance round the clock for days. Just send me a message if you are interested and I send you my login data.

Marcophono2 commented 1 year ago

@NouamaneTazi I lost so much time to find a solution that it would be a big pleasure for me to pay you for your work if you are successful. (but I need it latest on monday, sorry :-)

C43H66N12O12S2 commented 1 year ago

@Marcophono2 To build xformers for Lovelace, you need to modify torch/utils/cpp_extension.py to include CUDA arch "8.9"

PyTorch 1.13 regressed performance on my machine, so you may be losing performance there.

Marcophono2 commented 1 year ago

@C43H66N12O12S2 Interesting! But I think there is more necessary than adding

    ('Lovelace', '8.9+PTX'),

and

supported_arches = ['3.5', '3.7', '5.0', '5.2', '5.3', '6.0', '6.1', '6.2',
                    '7.0', '7.2', '7.5', '8.0', '8.6', '8.9']

:) But what exactly? I installed everything again and stay with PyTorch 1.12 now as you recommended.

C43H66N12O12S2 commented 1 year ago

After you modify the file, set TORCH_CUDA_ARCH_LIST=“8.9” env variable.

This is how I compile my Windows wheels.

Marcophono2 commented 1 year ago

@C43H66N12O12S2 Okay. But shouldn't it be the other way round? PyTorch must be new compiled with that added environment parameter, is that correct? But then the cpp_extension.py file will be overwritten.

Marcophono2 commented 1 year ago

@C43H66N12O12S2 And also I think I need cuda 11.8 for the SD project then, or not?

C43H66N12O12S2 commented 1 year ago

No, PyTorch (and the official releases) is fine. Modification of cpp_extension.py is necessary because PyTorch has a hardblock on any cuda arch not on their list.

You need 11.8 nvcc to compile CUDA 8.9, yes. Not for inference.

Marcophono2 commented 1 year ago

@C43H66N12O12S2 Okay. That sounds easier than awaited. :) Can you tell me where in the environment I have to add it? Or now to add it to a command?

C43H66N12O12S2 commented 1 year ago

In Linux, TORCH_CUDA_ARCH_LIST="8.9" pip wheel -e . inside cloned xformers repo should work.

Marcophono2 commented 1 year ago

Great Thanks a lot, @C43H66N12O12S2 ! I have a good feeling that this will bring me a big step forward! :+1:

Marcophono2 commented 1 year ago

@C43H66N12O12S2 I was too optimistical. I think I did it all correct (not really sure of course) but I get a large error output.

My setup:

I installed the branch from @MatthieuTPHR -> https://github.com/MatthieuTPHR/diffusers/archive/refs/heads/memory_efficient_attention.zip

and I installed xformers as you mentioned. If I then start a little test program

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
   "runwayml/stable-diffusion-v1-5", revision="fp16", torch_dtype=torch.float16, use_auth_token="hf_LFWSneVmdLYPKbkIRpCrCKxxx",
).to("cuda")

with torch.inference_mode(), torch.autocast("cuda"):
   image = pipe("a small cat")

with

USE_MEMORY_EFFICIENT_ATTENTION=1 python test.py

I receive this following long error text. The attention.py model realizes correctly that xformers is present. Any ideas what could be wrong? If I only start with

python test.py

the image is created but with less than 10it/s. A bit weak for a 4090. Also I noticed that my gpu memory is always around 22-23 GB occupied and utilization is at 99%.


    ~/Schreibtisch/AI  USE_MEMORY_EFFICIENT_ATTENTION=1 python test.py                                                IOT ✘  55s   SD  
xformers ist vorhanden
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 543/543 [00:00<00:00, 815kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 342/342 [00:00<00:00, 512kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.63k/4.63k [00:00<00:00, 6.35MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 608M/608M [00:06<00:00, 100MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 314kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 209/209 [00:00<00:00, 297kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 572/572 [00:00<00:00, 880kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 246M/246M [00:03<00:00, 77.8MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 1.26MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 472/472 [00:00<00:00, 719kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 788/788 [00:00<00:00, 1.19MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 1.75MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 772/772 [00:00<00:00, 1.16MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1.72G/1.72G [00:15<00:00, 114MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 550/550 [00:00<00:00, 834kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 167M/167M [00:02<00:00, 70.8MB/s]
Fetching 16 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:50<00:00,  3.13s/it]
  0%|                                                                                                                     | 0/51 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/marc/Schreibtisch/AI/test.py", line 10, in <module>
    image = pipe("a small cat")
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 326, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/unet_2d_condition.py", line 296, in forward
    sample, res_samples = downsample_block(
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/unet_2d_blocks.py", line 563, in forward
    hidden_states = attn(hidden_states, context=encoder_hidden_states)
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/attention.py", line 187, in forward
    hidden_states = block(hidden_states, context=context)
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/attention.py", line 236, in forward
    hidden_states = self.attn1(self.norm1(hidden_states)) + hidden_states
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/attention.py", line 275, in forward
    out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=self.attention_op)
  File "/home/marc/Schreibtisch/AI/xformers/xformers/ops.py", line 862, in memory_efficient_attention
    return op.forward_no_grad(
  File "/home/marc/Schreibtisch/AI/xformers/xformers/ops.py", line 305, in forward_no_grad
    return cls.FORWARD_OPERATOR(
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/_ops.py", line 143, in __call__
    return self._op(*args, **kwargs or {})
NotImplementedError: Could not run 'xformers::efficient_attention_forward_cutlass' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'xformers::efficient_attention_forward_cutlass' is only available for these backends: [UNKNOWN_TENSOR_TYPE_ID, QuantizedXPU, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, SparseCPU, SparseCUDA, SparseHIP, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, SparseVE, UNKNOWN_TENSOR_TYPE_ID, NestedTensorCUDA, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID].

BackendSelect: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/PythonFallbackKernel.cpp:133 [backend fallback]
Named: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/ConjugateFallback.cpp:18 [backend fallback]
Negative: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:35 [backend fallback]
AutogradCPU: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:39 [backend fallback]
AutogradCUDA: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:47 [backend fallback]
AutogradXLA: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:51 [backend fallback]
AutogradMPS: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:59 [backend fallback]
AutogradXPU: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:43 [backend fallback]
AutogradHPU: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:68 [backend fallback]
AutogradLazy: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:55 [backend fallback]
Tracer: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/torch/csrc/autograd/TraceTypeManual.cpp:295 [backend fallback]
AutocastCPU: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/autocast_mode.cpp:481 [backend fallback]
Autocast: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/autocast_mode.cpp:324 [backend fallback]
Batched: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/BatchingRegistrations.cpp:1064 [backend fallback]
VmapMode: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Functionalize: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/FunctionalizeFallbackKernel.cpp:89 [backend fallback]
PythonTLSSnapshot: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/PythonFallbackKernel.cpp:137 [backend fallback]

C43H66N12O12S2 commented 1 year ago

It looks like you made some errors while compiling and the resulting xformers lacks any SASS code for 8.9

As for performance issues with the 4090, you could try following my advice inside the thread posted earlier by Nouamane.

Marcophono2 commented 1 year ago

To be honest, I do not know where I missed something. I really would be happy if you can see something going wrong:

    ~  nvidia-smi                                                                                                               ✔  base  
Mon Oct 31 01:14:36 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:2D:00.0  On |                  Off |
|  0%   39C    P8    35W / 450W |    447MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       705      G   /usr/lib/Xorg                     210MiB |
|    0   N/A  N/A       873      G   /usr/bin/kwin_x11                  46MiB |
|    0   N/A  N/A       892      G   /usr/bin/plasmashell               57MiB |
|    0   N/A  N/A      1347      G   /usr/lib/firefox/firefox          126MiB |
+-----------------------------------------------------------------------------+

     ~  nvcc -V                                                                                                                  ✔  base  
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

    ~/Schreibtisch/AI/xformers    main !1 ?15  TORCH_CUDA_ARCH_LIST="8.9" pip wheel -e .                                       ✔  SD  
Obtaining file:///home/marc/Schreibtisch/AI/xformers
  Preparing metadata (setup.py) ... done
Collecting torch>=1.12
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/torch-1.13.0-cp39-cp39-manylinux1_x86_64.whl
Collecting numpy
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/numpy-1.23.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Collecting pyre-extensions==0.0.23
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/pyre_extensions-0.0.23-py3-none-any.whl
Collecting typing-extensions
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/typing_extensions-4.4.0-py3-none-any.whl
Collecting typing-inspect
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/typing_inspect-0.8.0-py3-none-any.whl
Collecting nvidia-cudnn-cu11==8.5.0.96
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl
Collecting nvidia-cublas-cu11==11.10.3.66
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl
Collecting nvidia-cuda-nvrtc-cu11==11.7.99
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl
Collecting nvidia-cuda-runtime-cu11==11.7.99
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl
Collecting wheel
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/wheel-0.37.1-py2.py3-none-any.whl
Collecting setuptools
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/setuptools-65.5.0-py3-none-any.whl
Collecting mypy-extensions>=0.3.0
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/mypy_extensions-0.4.3-py2.py3-none-any.whl
Building wheels for collected packages: xformers
  Building wheel for xformers (setup.py) ... done
  Created wheel for xformers: filename=xformers-0.0.14.dev0-cp39-cp39-linux_x86_64.whl size=34465759 sha256=6b285b6d9a37c887a8154cc1f00f7291e13dc6eb9b926c8bca7b64cc62607eca
  Stored in directory: /tmp/pip-ephem-wheel-cache-gnqnuxhl/wheels/f6/c7/73/63c154ea45fb20e7eec4f956dfb9c91be386a33afb31b7c359
Successfully built xformers

Yes, I will check again what @NouamaneTazi suggested. I thought that these way is no option because there are some windows dlls involved.

Marcophono2 commented 1 year ago

@C43H66N12O12S2 , @NouamaneTazi DAMN!! Just replacing the cuDNN stuff in the torch lib directory brought a 100% speed up mega punch!!! From 9.5 to 17.5it/s. You made an old man happy and smile for the first time since two weeks!! I already used this great trick in the AUTOMATIC111 webUI version successfully but thought, whyever, this isn't possible for Linux. Now I must implement that xFormers thing and .... YEEEEAH!! :-D :-D

Marcophono2 commented 1 year ago

@C43H66N12O12S2 , @NouamaneTazi Now at 25it/s! :-))) Still no xFormers. I only let build again uunet weights and added it to the pipeline. flax is similar fast by the way.

C43H66N12O12S2 commented 1 year ago

You can try env variable without quotes, like this TORCH_CUDA_ARCH_LIST=8.9 pip wheel -e .

If that fails as well, no idea.

Marcophono2 commented 1 year ago

@C43H66N12O12S2 No, it dis still not work. But after a new setup I am at 28it/s including Euler_a. Probably the PyTorch Nightly (1.4.) gave an extra punch. Meanwhile I am not sure if xFormers would really be able to give still more improvement!? Is xFormers undependigly from unet? Or is it kind of another "version" of the unet implementation from the technical point of view?

P.S.: Is this only a subjective impression by myself or is Euler (Euler_a) really influencing significantly better results? I only try to create images with photorealistic scenes, so I cannot compare this scheduler with others in other disciplines like painting, digital art or else.

XodrocSO commented 1 year ago

@Marcophono2 Can you give a breakdown of what you had to do to get this working for people lacking too much sleep?

But after a new setup I am at 28it/s including Euler_a. Probably the PyTorch Nightly (1.4.) gave an extra punch.

Marcophono2 commented 1 year ago

Sure, @XodrocSO . Aside from the fact that this RP meanwhile supports Euler too, I can simple tell how I increased my performance for my 4090 (which is now a bit > 30it/s, whyever, and 19.5it/s on a 3090). The most important thing is to update the cudnn files. I must search for the description and the direct download link first again, so let me ask you first: You are talking about 4090 support under linux? If other than 4090 there is no need to update the cudnn files. If you are under Windows and 4090 you also can update the cudnn files but in this case the files are different ones.

XodrocSO commented 1 year ago

@Marcophono2 Windows and 4090 basically, Thanks!

Marcophono2 commented 1 year ago

So sorry for the delay, @XodrocSO. In the night to yesterday I was so clever to crash my Linux system after a total useless and riskful installation of a classifier to my SD environment which overwrote a lot of packages and depencies so that my wonderful optimized SD was crashing from > 30 it/s to 1.5it/s. On a 4090. OH-MY-GOD! And of course I had no backup. Okay, a helpful backup must have been a complete partition mirroring. But I thought in the worst case I can simply repeat the steps I successfully took. But wrong! Obviously I forgot a lot of things. Matching versions, the order of the installation steps and when cuda and when pip to install. I am on Manjaro so there are not many descriptions I could look for. The Ubuntu setup does not work here. When I made some Google researches I always found my own happy writing about my success - what I destroyed in a moment of brainless.. Anway, after 18 hours I was able to set it up again. And of course I documented every step now. :-)

But that's not the point. You are on Windows so it is easier because there are some good step-by-step howto's to find in the SDwebui RP. Yes, it is not for this SD here (without gui) but I am sure that it will do the necessary setup for you that you can also use this SD after that steps. The point is that PyTorch still not support Lovelace (4090/4080) in the default setup. The wheel which was build by @C43H66N12O12S2 is really a wonderful help for Windows and inject the cudnn libs into PyTorch. Please theck the description from @sigglypuff : https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4316#issuecomment-1304612278 Just have a look that he corrected one step in the following posting before you go through it too chronological.

C43H66N12O12S2 commented 1 year ago

@Marcophono2 You say you're using Manjaro. Gnome version? If so, IIRC that uses zsh by default. Maybe that's why my command didn't work.

Try launching that command from bash, like this bash TORCH_CUDA_ARCH_LIST=8.9 pip wheel -e .

Marcophono2 commented 1 year ago

Interesting point, @C43H66N12O12S2 . I have the KDE edition. I tried what you wrote but got this error output:

 [...]   
      nvcc fatal   : Failed to preprocess host compiler properties.
      [5/5] c++ -MMD -MF /home/marc/Schreibtisch/AI/xformers/build/temp.linux-x86_64-cpython-310/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.o.d -pthread -B /home/anaconda3/envs/MII/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/anaconda3/envs/MII/include -fPIC -O2 -isystem /home/anaconda3/envs/MII/include -fPIC -I/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn -I/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src -I/home/marc/Schreibtisch/AI/xformers/third_party/cutlass/include -I/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/include -I/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/include/TH -I/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/include/THC -I/opt/cuda/include -I/home/anaconda3/envs/MII/include/python3.10 -c -c /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp -o /home/marc/Schreibtisch/AI/xformers/build/temp.linux-x86_64-cpython-310/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.o -O3 -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C_flashattention -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
      In Datei, eingebunden von /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha.h:41,
                       von /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp:32:
      /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha_utils.h: In Funktion »void set_alpha(uint32_t&, float, Data_type)«:
      /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha_utils.h:63:53: Warnung: Dereferenzierung eines Type-Pun-Zeigers verletzt strict-aliasing-Regeln [-Wstrict-aliasing]
         63 |         alpha = reinterpret_cast<const uint32_t &>( h2 );
            |                                                     ^~
      /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha_utils.h:68:53: Warnung: Dereferenzierung eines Type-Pun-Zeigers verletzt strict-aliasing-Regeln [-Wstrict-aliasing]
         68 |         alpha = reinterpret_cast<const uint32_t &>( h2 );
            |                                                     ^~
      /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha_utils.h:70:53: Warnung: Dereferenzierung eines Type-Pun-Zeigers verletzt strict-aliasing-Regeln [-Wstrict-aliasing]
         70 |         alpha = reinterpret_cast<const uint32_t &>( norm );
            |                                                     ^~~~
      /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp: In Funktion »void set_params_fprop(FMHA_fprop_params&, size_t, size_t, size_t, size_t, size_t, at::Tensor, at::Tensor, at::Tensor, void*, void*, void*, void*, void*, void*, float, float, bool)«:
      /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp:62:11: Warnung: »void* memset(void*, int, size_t)« Säubern eines Objekts von nichttrivialem Typ »struct FMHA_fprop_params«; use assignment or value-initialization instead [-Wclass-memaccess]
         62 |     memset(&params, 0, sizeof(params));
            |     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha.h:74:8: Anmerkung: »struct FMHA_fprop_params« wird hier deklariert
         74 | struct FMHA_fprop_params : public Qkv_params {
            |        ^~~~~~~~~~~~~~~~~
      /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp:58:15: Warnung: Variable »acc_type« wird nicht verwendet [-Wunused-variable]
         58 |     Data_type acc_type = DATA_TYPE_FP32;
            |               ^~~~~~~~
      /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp: In Funktion »std::vector<at::Tensor> mha_bwd_block(const at::Tensor&, const at::Tensor&, const at::Tensor&, const at::Tensor&, const at::Tensor&, const at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, const at::Tensor&, const at::Tensor&, const at::Tensor&, int, int, float, float, bool, c10::optional<at::Generator>)«:
      /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp:597:10: Warnung: Variable »is_sm8x« wird nicht verwendet [-Wunused-variable]
        597 |     bool is_sm8x = dprops->major == 8 && dprops->minor >= 0;
            |          ^~~~~~~
      ninja: build stopped: subcommand failed.
      Traceback (most recent call last):
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1901, in _run_ninja_build
          subprocess.run(
        File "/home/anaconda3/envs/MII/lib/python3.10/subprocess.py", line 524, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/marc/Schreibtisch/AI/xformers/setup.py", line 251, in <module>
          setuptools.setup(
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/__init__.py", line 87, in setup
          return distutils.core.setup(**attrs)
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
          return run_commands(dist)
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
          dist.run_commands()
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 968, in run_commands
          self.run_command(cmd)
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
          super().run_command(command)
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
          cmd_obj.run()
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 299, in run
          self.run_command('build')
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
          self.distribution.run_command(command)
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
          super().run_command(command)
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
          cmd_obj.run()
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 132, in run
          self.run_command(cmd_name)
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
          self.distribution.run_command(command)
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
          super().run_command(command)
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
          cmd_obj.run()
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 84, in run
          _build_ext.run(self)
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
          self.build_extensions()
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
          build_ext.build_extensions(self)
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 466, in build_extensions
          self._build_extensions_serial()
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 492, in _build_extensions_serial
          self.build_extension(ext)
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
          _build_ext.build_extension(self, ext)
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 547, in build_extension
          objects = self.compiler.compile(
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
          _write_ninja_file_and_compile_objects(
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1573, in _write_ninja_file_and_compile_objects
          _run_ninja_build(
        File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1917, in _run_ninja_build
          raise RuntimeError(message) from e
      RuntimeError: Error compiling objects for extension
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for xformers
  Running setup.py clean for xformers
Failed to build xformers
ERROR: Failed to build one or more wheels

XodrocSO commented 1 year ago

Ouch @Marcophono2 , sounds like quite the headache! Thanks for the info!

Marcophono2 commented 1 year ago

Eureka! Finally I got it!! :-) Thank you again for your valuable input, @C43H66N12O12S2 ! I switched bacj to Ubuntu and was able to encode xFormers with the sm 89 cuda cores there. Now my 4090 is runnning with impressing 42 it/s! I cannot stop watching it! As I have two of them and a 3090 as a third card in my computer I have no understanding about the sorrows of others here in Germany facing a cold winter while the energy support is not guaranteed. In my room it is warm within a few minutes. ROFL! I must give in that probably not Manjaro was the problem but my own idiocy. I simply omited the fact that I have to add

pipe.enable_xformers_memory_efficient_attention()

into my Python code. Grrrr!!! But never mind! I learned a lot in the time I viewed hundreds of files to follow the phantasmal problem.

But one thing is a bit strange: While my 4090s perform at 42it/s my 3090 is only at 20it/s. That does not reflect the pure hardware power. The 3090 should be at 28it/s. At least before my final tunings the distance between the 3090 and the 4090 was always about 50%. Or the other way round: The 3090 was at 67% of the 4090. But why is the 3090 now at less than 50%? Any ideas? For the 3090 I set up another conda environment where I coded xFormers with the standard setting, without replacing the cudnn files.

Best regards Marc

C43H66N12O12S2 commented 1 year ago

Great work!

You should test both with replaced cuDNN (8.6.0), as it’ll improve performance even with the 3090 (to a lesser degree compared with the 4090.)

However, Lovelace is simply a better architecture, with significant upgrades to the entire core. (L2, Tensor cores, even the shader cores), so it wouldn’t surprise me to witness the 4090 punching above its weight.

patrickvonplaten commented 1 year ago

Interesting! Also cc @pcuenca here FYI

richard-schwab commented 1 year ago

@Marcophono2 I'm having the same error: 597 | bool is_sm8x = dprops->major == 8 && dprops->minor >= 0;

But it's while trying to compile xformers for the Stable Diffusion v2. Where did you add the line?

pipe.enable_xformers_memory_efficient_attention()

Thanks for any help you can give.

Marcophono2 commented 1 year ago

@richard-schwab Directly before the pipeline creates the image. I am now on the wrong computer. If needed, I will copy you the related part later.

patrickvonplaten commented 1 year ago

Maybe also @anton-l @pcuenca @NouamaneTazi if you have any tips / hints here

pcuenca commented 1 year ago

Hi @Marcophono2! Interesting comments. My 3090 runs at about 20 it/s too (with xFormers efficient attention enabled). It gets a bit better as you increase the batch size, up to ~26 it/s for a large batch size of 24 images at once. I'm surprised that your 4090 is capable of performing inference twice as fast, I'd be really happy about it! I don't have a 4090, but we may able to test on one soon.

sandys commented 1 year ago

Hi @Marcophono2! Interesting comments. My 3090 runs at about 20 it/s too (with xFormers efficient attention enabled). It gets a bit better as you increase the batch size, up to ~26 it/s for a large batch size of 24 images at once. I'm surprised that your 4090 is capable of performing inference twice as fast, I'd be really happy about it! I don't have a 4090, but we may able to test on one soon.

could you mention how you built for the 3090? im still getting "efficient_attention_forward_cutlass" errors even after trying all kinds of builds.

%env TORCH_CUDA_ARCH_LIST = "8.6"
%env FORCE_CUDA = "1"
%env CUDA_VISIBLE_DEVICES = 0 
%pip install --no-clean git+https://github.com/facebookresearch/xformers#egg=xformers

harishprabhala commented 1 year ago

Hi @Marcophono2 could you please detail the steps that you've followed to get the 17 it/s? I have a 4090 and I am at 10 it/s as well.

CUDA 12.0 Driver 525 Ubuntu 22.04

Marcophono2 commented 1 year ago

@harishprabhala 17 it/s? That is long time ago. I am at 42 it/s since a few weeks. 😄 I also wrote that here some posts later. I will come back to you later today.

Marcophono2 commented 1 year ago

@harishprabhala 42 it/s on SD 1.5. Running on SD2.1 at 768x768px I am at 20 it/s.

harishprabhala commented 1 year ago

@Marcophono2 I am curious as to how to got to 24 it/s without xformers. I am only interested in PyTorch performance. with voltaML (TensorRT) I am getting 84 it/s :)

Marcophono2 commented 1 year ago

@harishprabhala

with voltaML (TensorRT) I am getting 84 it/s :)

WHAT?? And I thought I am the champ here in this repository with my 42 it/s. 😄 So your idea is that you could improve your performance by 70%? Related to my 17 it/s to your 10 it/s without xformers. It seems that I am out of date with my software. I have the 5.20.61.05 geforce driver (which is still the latest one as I had installed the studio driver while you have installed the games ready driver; may be this is the reason..) Also I did not know that cuda 12 is already available. I still have 11.8.

What finally brought me the punch from 9.5 to 17 it/s (without xformers) was to replace the cudnn files in /home/{user}/anaconda3/envs/{envname}/lib/python3.10/site-packages/torch/lib

In my case I used a package named "cudnn-linux-x86_64-8.6.0.163_cuda11-archive" but as you have a newer cuda version I think also your cudnn version is newer. I have 11.8.89 -> nvcc -V What is your version?

sile16 commented 1 year ago

I'm also interested as the investment in a 4090 has not paid off yet, I've successfully compiled xformers with pytorch 1.13.1 cu117, but single image sd 1.4 512x512 is 11.5 it/s .

I replaced the lib files and got up to 20 it/s. but no where near 40. This is for a single image.

[+] bitsandbytes version 0.35.0 installed. [+] diffusers version 0.10.2 installed. [+] transformers version 4.25.1 installed. [+] xformers version 0.0.15.dev0+103e863.d20221120 installed. [+] torch version 1.13.1+cu117 installed. [+] torchvision version 0.14.1+cu117 installed.

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1173 G /usr/lib/xorg/Xorg 13MiB | | 0 N/A N/A 1371 G /usr/bin/gnome-shell 12MiB | | 0 N/A N/A 4147066 C python3 3808MiB |

Marcophono2 commented 1 year ago

@sile16 Hi Matt! I think you did not compile xformers correctly with cuda 11.8. Can you tell me first the value you get by entering

nvcc -V

Marcophono2 commented 1 year ago

@sile16 Replacing the lib files brings one improvement for the direct calculations. Compiling xformers basing on cuda 11.8 brings a second punch. cuda 12 is available but I still have no experience with is together with xformers. Tomorrow I get the last component for a second server. Then I will set up a system with Nvidia's game-ready driver which hat a higher version no and also cuda 12.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / diffusers