Closed Marcophono2 closed 1 year ago
Uff I don't really know the details of GPU hardware enough here sadly @NouamaneTazi do you have a hunch maybe? :-)
@NouamaneTazi , do you have a hunch? :)
I never worked with Pytorch 1.13 nor CUDA 11.8 before. Does the 3 benchmarks use the same environment @Marcophono2?
Yes, they did, @NouamaneTazi . Meanwhile I found out that the RP https://github.com/AUTOMATIC1111/stable-diffusion-webui offers a technical support for the 4090. I can produce a 640x640px image with 22it/s. I used this unusual size because with 512x512 my GPU utilization is only 60%. There is no other bottleneck. So I could enlarge the size until the utilization is nearly 100% without losing speed. So, the performance is great but this is a Windows solution only and at the moment only with a web UI. Not so good for command line based processing with multi GPU platform.
I'm afraid I can't help much myself as I don't have access to any Lovelace GPU. It seems that updating cudnn should help speed up inference. I would recommend you follow the following threads: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2449 and https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/2537. The same should apply to diffusers
@NouamaneTazi , yes, I know that threads and followed every detail. But finally it was not able to build it in the same, or in a similar way, for Linux. Oh, no problem that you don't have access to a Lovelace GPU. Feel free to use my vasti account and book a 4090 instance there. Right at this moment I am using one. My billing account there is good feeded, so you can use a 4090 instance round the clock for days. Just send me a message if you are interested and I send you my login data.
@NouamaneTazi I lost so much time to find a solution that it would be a big pleasure for me to pay you for your work if you are successful. (but I need it latest on monday, sorry :-)
@Marcophono2 To build xformers for Lovelace, you need to modify torch/utils/cpp_extension.py to include CUDA arch "8.9"
PyTorch 1.13 regressed performance on my machine, so you may be losing performance there.
@C43H66N12O12S2 Interesting! But I think there is more necessary than adding
('Lovelace', '8.9+PTX'),
and
supported_arches = ['3.5', '3.7', '5.0', '5.2', '5.3', '6.0', '6.1', '6.2',
'7.0', '7.2', '7.5', '8.0', '8.6', '8.9']
:) But what exactly? I installed everything again and stay with PyTorch 1.12 now as you recommended.
After you modify the file, set TORCH_CUDA_ARCH_LIST=β8.9β
env variable.
This is how I compile my Windows wheels.
@C43H66N12O12S2 Okay. But shouldn't it be the other way round? PyTorch must be new compiled with that added environment parameter, is that correct? But then the cpp_extension.py file will be overwritten.
@C43H66N12O12S2 And also I think I need cuda 11.8 for the SD project then, or not?
No, PyTorch (and the official releases) is fine. Modification of cpp_extension.py is necessary because PyTorch has a hardblock on any cuda arch not on their list.
You need 11.8 nvcc to compile CUDA 8.9, yes. Not for inference.
@C43H66N12O12S2 Okay. That sounds easier than awaited. :) Can you tell me where in the environment I have to add it? Or now to add it to a command?
In Linux, TORCH_CUDA_ARCH_LIST="8.9" pip wheel -e .
inside cloned xformers repo should work.
Great Thanks a lot, @C43H66N12O12S2 ! I have a good feeling that this will bring me a big step forward! :+1:
@C43H66N12O12S2 I was too optimistical. I think I did it all correct (not really sure of course) but I get a large error output.
My setup:
I installed the branch from @MatthieuTPHR -> https://github.com/MatthieuTPHR/diffusers/archive/refs/heads/memory_efficient_attention.zip
and I installed xformers as you mentioned. If I then start a little test program
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", revision="fp16", torch_dtype=torch.float16, use_auth_token="hf_LFWSneVmdLYPKbkIRpCrCKxxx",
).to("cuda")
with torch.inference_mode(), torch.autocast("cuda"):
image = pipe("a small cat")
with
USE_MEMORY_EFFICIENT_ATTENTION=1 python test.py
I receive this following long error text. The attention.py model realizes correctly that xformers is present. Any ideas what could be wrong? If I only start with
python test.py
the image is created but with less than 10it/s. A bit weak for a 4090. Also I noticed that my gpu memory is always around 22-23 GB occupied and utilization is at 99%.
ο ξ° οΌ ~/Schreibtisch/AI ξ° USE_MEMORY_EFFICIENT_ATTENTION=1 python test.py ξ² IOT β ξ² 55s ο ξ² SD ξΌ
xformers ist vorhanden
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 543/543 [00:00<00:00, 815kB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 342/342 [00:00<00:00, 512kB/s]
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.63k/4.63k [00:00<00:00, 6.35MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 608M/608M [00:06<00:00, 100MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 209/209 [00:00<00:00, 314kB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 209/209 [00:00<00:00, 297kB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 572/572 [00:00<00:00, 880kB/s]
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 246M/246M [00:03<00:00, 77.8MB/s]
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 525k/525k [00:00<00:00, 1.26MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 472/472 [00:00<00:00, 719kB/s]
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 788/788 [00:00<00:00, 1.19MB/s]
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.06M/1.06M [00:00<00:00, 1.75MB/s]
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 772/772 [00:00<00:00, 1.16MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.72G/1.72G [00:15<00:00, 114MB/s]
Downloading: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 550/550 [00:00<00:00, 834kB/s]
Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 167M/167M [00:02<00:00, 70.8MB/s]
Fetching 16 files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 16/16 [00:50<00:00, 3.13s/it]
0%| | 0/51 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/marc/Schreibtisch/AI/test.py", line 10, in <module>
image = pipe("a small cat")
File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 326, in __call__
noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/unet_2d_condition.py", line 296, in forward
sample, res_samples = downsample_block(
File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/unet_2d_blocks.py", line 563, in forward
hidden_states = attn(hidden_states, context=encoder_hidden_states)
File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/attention.py", line 187, in forward
hidden_states = block(hidden_states, context=context)
File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/attention.py", line 236, in forward
hidden_states = self.attn1(self.norm1(hidden_states)) + hidden_states
File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/attention.py", line 275, in forward
out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=self.attention_op)
File "/home/marc/Schreibtisch/AI/xformers/xformers/ops.py", line 862, in memory_efficient_attention
return op.forward_no_grad(
File "/home/marc/Schreibtisch/AI/xformers/xformers/ops.py", line 305, in forward_no_grad
return cls.FORWARD_OPERATOR(
File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/_ops.py", line 143, in __call__
return self._op(*args, **kwargs or {})
NotImplementedError: Could not run 'xformers::efficient_attention_forward_cutlass' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'xformers::efficient_attention_forward_cutlass' is only available for these backends: [UNKNOWN_TENSOR_TYPE_ID, QuantizedXPU, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, SparseCPU, SparseCUDA, SparseHIP, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, SparseVE, UNKNOWN_TENSOR_TYPE_ID, NestedTensorCUDA, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID].
BackendSelect: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/PythonFallbackKernel.cpp:133 [backend fallback]
Named: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/ConjugateFallback.cpp:18 [backend fallback]
Negative: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:35 [backend fallback]
AutogradCPU: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:39 [backend fallback]
AutogradCUDA: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:47 [backend fallback]
AutogradXLA: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:51 [backend fallback]
AutogradMPS: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:59 [backend fallback]
AutogradXPU: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:43 [backend fallback]
AutogradHPU: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:68 [backend fallback]
AutogradLazy: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:55 [backend fallback]
Tracer: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/torch/csrc/autograd/TraceTypeManual.cpp:295 [backend fallback]
AutocastCPU: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/autocast_mode.cpp:481 [backend fallback]
Autocast: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/autocast_mode.cpp:324 [backend fallback]
Batched: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/BatchingRegistrations.cpp:1064 [backend fallback]
VmapMode: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Functionalize: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/FunctionalizeFallbackKernel.cpp:89 [backend fallback]
PythonTLSSnapshot: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/PythonFallbackKernel.cpp:137 [backend fallback]
It looks like you made some errors while compiling and the resulting xformers lacks any SASS code for 8.9
As for performance issues with the 4090, you could try following my advice inside the thread posted earlier by Nouamane.
To be honest, I do not know where I missed something. I really would be happy if you can see something going wrong:
ο ξ° ο ~ ξ° nvidia-smi ξ² β ξ² base ξΌ
Mon Oct 31 01:14:36 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06 Driver Version: 520.56.06 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:2D:00.0 On | Off |
| 0% 39C P8 35W / 450W | 447MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 705 G /usr/lib/Xorg 210MiB |
| 0 N/A N/A 873 G /usr/bin/kwin_x11 46MiB |
| 0 N/A N/A 892 G /usr/bin/plasmashell 57MiB |
| 0 N/A N/A 1347 G /usr/lib/firefox/firefox 126MiB |
+-----------------------------------------------------------------------------+
ο ξ° ο ~ ξ° nvcc -V ξ² β ξ² base ξΌ
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
ο ξ° οΌ ~/Schreibtisch/AI/xformers ξ° ο ο¦ main !1 ?15 ξ° TORCH_CUDA_ARCH_LIST="8.9" pip wheel -e . ξ² β ξ² SD ξΌ
Obtaining file:///home/marc/Schreibtisch/AI/xformers
Preparing metadata (setup.py) ... done
Collecting torch>=1.12
File was already downloaded /home/marc/Schreibtisch/AI/xformers/torch-1.13.0-cp39-cp39-manylinux1_x86_64.whl
Collecting numpy
File was already downloaded /home/marc/Schreibtisch/AI/xformers/numpy-1.23.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Collecting pyre-extensions==0.0.23
File was already downloaded /home/marc/Schreibtisch/AI/xformers/pyre_extensions-0.0.23-py3-none-any.whl
Collecting typing-extensions
File was already downloaded /home/marc/Schreibtisch/AI/xformers/typing_extensions-4.4.0-py3-none-any.whl
Collecting typing-inspect
File was already downloaded /home/marc/Schreibtisch/AI/xformers/typing_inspect-0.8.0-py3-none-any.whl
Collecting nvidia-cudnn-cu11==8.5.0.96
File was already downloaded /home/marc/Schreibtisch/AI/xformers/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl
Collecting nvidia-cublas-cu11==11.10.3.66
File was already downloaded /home/marc/Schreibtisch/AI/xformers/nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl
Collecting nvidia-cuda-nvrtc-cu11==11.7.99
File was already downloaded /home/marc/Schreibtisch/AI/xformers/nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl
Collecting nvidia-cuda-runtime-cu11==11.7.99
File was already downloaded /home/marc/Schreibtisch/AI/xformers/nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl
Collecting wheel
File was already downloaded /home/marc/Schreibtisch/AI/xformers/wheel-0.37.1-py2.py3-none-any.whl
Collecting setuptools
File was already downloaded /home/marc/Schreibtisch/AI/xformers/setuptools-65.5.0-py3-none-any.whl
Collecting mypy-extensions>=0.3.0
File was already downloaded /home/marc/Schreibtisch/AI/xformers/mypy_extensions-0.4.3-py2.py3-none-any.whl
Building wheels for collected packages: xformers
Building wheel for xformers (setup.py) ... done
Created wheel for xformers: filename=xformers-0.0.14.dev0-cp39-cp39-linux_x86_64.whl size=34465759 sha256=6b285b6d9a37c887a8154cc1f00f7291e13dc6eb9b926c8bca7b64cc62607eca
Stored in directory: /tmp/pip-ephem-wheel-cache-gnqnuxhl/wheels/f6/c7/73/63c154ea45fb20e7eec4f956dfb9c91be386a33afb31b7c359
Successfully built xformers
Yes, I will check again what @NouamaneTazi suggested. I thought that these way is no option because there are some windows dlls involved.
@C43H66N12O12S2 , @NouamaneTazi DAMN!! Just replacing the cuDNN stuff in the torch lib directory brought a 100% speed up mega punch!!! From 9.5 to 17.5it/s. You made an old man happy and smile for the first time since two weeks!! I already used this great trick in the AUTOMATIC111 webUI version successfully but thought, whyever, this isn't possible for Linux. Now I must implement that xFormers thing and .... YEEEEAH!! :-D :-D
@C43H66N12O12S2 , @NouamaneTazi Now at 25it/s! :-))) Still no xFormers. I only let build again uunet weights and added it to the pipeline. flax is similar fast by the way.
You can try env variable without quotes, like this TORCH_CUDA_ARCH_LIST=8.9 pip wheel -e .
If that fails as well, no idea.
@C43H66N12O12S2 No, it dis still not work. But after a new setup I am at 28it/s including Euler_a. Probably the PyTorch Nightly (1.4.) gave an extra punch. Meanwhile I am not sure if xFormers would really be able to give still more improvement!? Is xFormers undependigly from unet? Or is it kind of another "version" of the unet implementation from the technical point of view?
P.S.: Is this only a subjective impression by myself or is Euler (Euler_a) really influencing significantly better results? I only try to create images with photorealistic scenes, so I cannot compare this scheduler with others in other disciplines like painting, digital art or else.
@Marcophono2 Can you give a breakdown of what you had to do to get this working for people lacking too much sleep?
But after a new setup I am at 28it/s including Euler_a. Probably the PyTorch Nightly (1.4.) gave an extra punch.
Sure, @XodrocSO . Aside from the fact that this RP meanwhile supports Euler too, I can simple tell how I increased my performance for my 4090 (which is now a bit > 30it/s, whyever, and 19.5it/s on a 3090). The most important thing is to update the cudnn files. I must search for the description and the direct download link first again, so let me ask you first: You are talking about 4090 support under linux? If other than 4090 there is no need to update the cudnn files. If you are under Windows and 4090 you also can update the cudnn files but in this case the files are different ones.
@Marcophono2 Windows and 4090 basically, Thanks!
So sorry for the delay, @XodrocSO. In the night to yesterday I was so clever to crash my Linux system after a total useless and riskful installation of a classifier to my SD environment which overwrote a lot of packages and depencies so that my wonderful optimized SD was crashing from > 30 it/s to 1.5it/s. On a 4090. OH-MY-GOD! And of course I had no backup. Okay, a helpful backup must have been a complete partition mirroring. But I thought in the worst case I can simply repeat the steps I successfully took. But wrong! Obviously I forgot a lot of things. Matching versions, the order of the installation steps and when cuda and when pip to install. I am on Manjaro so there are not many descriptions I could look for. The Ubuntu setup does not work here. When I made some Google researches I always found my own happy writing about my success - what I destroyed in a moment of brainless.. Anway, after 18 hours I was able to set it up again. And of course I documented every step now. :-)
But that's not the point. You are on Windows so it is easier because there are some good step-by-step howto's to find in the SDwebui RP. Yes, it is not for this SD here (without gui) but I am sure that it will do the necessary setup for you that you can also use this SD after that steps. The point is that PyTorch still not support Lovelace (4090/4080) in the default setup. The wheel which was build by @C43H66N12O12S2 is really a wonderful help for Windows and inject the cudnn libs into PyTorch. Please theck the description from @sigglypuff : https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4316#issuecomment-1304612278 Just have a look that he corrected one step in the following posting before you go through it too chronological.
@Marcophono2 You say you're using Manjaro. Gnome version? If so, IIRC that uses zsh by default. Maybe that's why my command didn't work.
Try launching that command from bash, like this
bash
TORCH_CUDA_ARCH_LIST=8.9 pip wheel -e .
Interesting point, @C43H66N12O12S2 . I have the KDE edition. I tried what you wrote but got this error output:
[...]
nvcc fatal : Failed to preprocess host compiler properties.
[5/5] c++ -MMD -MF /home/marc/Schreibtisch/AI/xformers/build/temp.linux-x86_64-cpython-310/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.o.d -pthread -B /home/anaconda3/envs/MII/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/anaconda3/envs/MII/include -fPIC -O2 -isystem /home/anaconda3/envs/MII/include -fPIC -I/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn -I/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src -I/home/marc/Schreibtisch/AI/xformers/third_party/cutlass/include -I/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/include -I/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/include/TH -I/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/include/THC -I/opt/cuda/include -I/home/anaconda3/envs/MII/include/python3.10 -c -c /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp -o /home/marc/Schreibtisch/AI/xformers/build/temp.linux-x86_64-cpython-310/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.o -O3 -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C_flashattention -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
In Datei, eingebunden von /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha.h:41,
von /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp:32:
/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha_utils.h: In Funktion Β»void set_alpha(uint32_t&, float, Data_type)Β«:
/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha_utils.h:63:53: Warnung: Dereferenzierung eines Type-Pun-Zeigers verletzt strict-aliasing-Regeln [-Wstrict-aliasing]
63 | alpha = reinterpret_cast<const uint32_t &>( h2 );
| ^~
/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha_utils.h:68:53: Warnung: Dereferenzierung eines Type-Pun-Zeigers verletzt strict-aliasing-Regeln [-Wstrict-aliasing]
68 | alpha = reinterpret_cast<const uint32_t &>( h2 );
| ^~
/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha_utils.h:70:53: Warnung: Dereferenzierung eines Type-Pun-Zeigers verletzt strict-aliasing-Regeln [-Wstrict-aliasing]
70 | alpha = reinterpret_cast<const uint32_t &>( norm );
| ^~~~
/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp: In Funktion Β»void set_params_fprop(FMHA_fprop_params&, size_t, size_t, size_t, size_t, size_t, at::Tensor, at::Tensor, at::Tensor, void*, void*, void*, void*, void*, void*, float, float, bool)Β«:
/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp:62:11: Warnung: Β»void* memset(void*, int, size_t)Β« SΓ€ubern eines Objekts von nichttrivialem Typ Β»struct FMHA_fprop_paramsΒ«; use assignment or value-initialization instead [-Wclass-memaccess]
62 | memset(¶ms, 0, sizeof(params));
| ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha.h:74:8: Anmerkung: Β»struct FMHA_fprop_paramsΒ« wird hier deklariert
74 | struct FMHA_fprop_params : public Qkv_params {
| ^~~~~~~~~~~~~~~~~
/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp:58:15: Warnung: Variable Β»acc_typeΒ« wird nicht verwendet [-Wunused-variable]
58 | Data_type acc_type = DATA_TYPE_FP32;
| ^~~~~~~~
/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp: In Funktion Β»std::vector<at::Tensor> mha_bwd_block(const at::Tensor&, const at::Tensor&, const at::Tensor&, const at::Tensor&, const at::Tensor&, const at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, const at::Tensor&, const at::Tensor&, const at::Tensor&, int, int, float, float, bool, c10::optional<at::Generator>)Β«:
/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp:597:10: Warnung: Variable Β»is_sm8xΒ« wird nicht verwendet [-Wunused-variable]
597 | bool is_sm8x = dprops->major == 8 && dprops->minor >= 0;
| ^~~~~~~
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1901, in _run_ninja_build
subprocess.run(
File "/home/anaconda3/envs/MII/lib/python3.10/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/home/marc/Schreibtisch/AI/xformers/setup.py", line 251, in <module>
setuptools.setup(
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/__init__.py", line 87, in setup
return distutils.core.setup(**attrs)
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 968, in run_commands
self.run_command(cmd)
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
super().run_command(command)
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
cmd_obj.run()
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 299, in run
self.run_command('build')
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
self.distribution.run_command(command)
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
super().run_command(command)
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
cmd_obj.run()
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 132, in run
self.run_command(cmd_name)
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
self.distribution.run_command(command)
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
super().run_command(command)
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
cmd_obj.run()
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 84, in run
_build_ext.run(self)
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
self.build_extensions()
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
build_ext.build_extensions(self)
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 466, in build_extensions
self._build_extensions_serial()
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 492, in _build_extensions_serial
self.build_extension(ext)
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
_build_ext.build_extension(self, ext)
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 547, in build_extension
objects = self.compiler.compile(
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1573, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1917, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for xformers
Running setup.py clean for xformers
Failed to build xformers
ERROR: Failed to build one or more wheels
Ouch @Marcophono2 , sounds like quite the headache! Thanks for the info!
Eureka! Finally I got it!! :-) Thank you again for your valuable input, @C43H66N12O12S2 ! I switched bacj to Ubuntu and was able to encode xFormers with the sm 89 cuda cores there. Now my 4090 is runnning with impressing 42 it/s! I cannot stop watching it! As I have two of them and a 3090 as a third card in my computer I have no understanding about the sorrows of others here in Germany facing a cold winter while the energy support is not guaranteed. In my room it is warm within a few minutes. ROFL! I must give in that probably not Manjaro was the problem but my own idiocy. I simply omited the fact that I have to add
pipe.enable_xformers_memory_efficient_attention()
into my Python code. Grrrr!!! But never mind! I learned a lot in the time I viewed hundreds of files to follow the phantasmal problem.
But one thing is a bit strange: While my 4090s perform at 42it/s my 3090 is only at 20it/s. That does not reflect the pure hardware power. The 3090 should be at 28it/s. At least before my final tunings the distance between the 3090 and the 4090 was always about 50%. Or the other way round: The 3090 was at 67% of the 4090. But why is the 3090 now at less than 50%? Any ideas? For the 3090 I set up another conda environment where I coded xFormers with the standard setting, without replacing the cudnn files.
Best regards Marc
Great work!
You should test both with replaced cuDNN (8.6.0), as itβll improve performance even with the 3090 (to a lesser degree compared with the 4090.)
However, Lovelace is simply a better architecture, with significant upgrades to the entire core. (L2, Tensor cores, even the shader cores), so it wouldnβt surprise me to witness the 4090 punching above its weight.
Interesting! Also cc @pcuenca here FYI
@Marcophono2 I'm having the same error:
597 | bool is_sm8x = dprops->major == 8 && dprops->minor >= 0;
But it's while trying to compile xformers for the Stable Diffusion v2. Where did you add the line?
pipe.enable_xformers_memory_efficient_attention()
Thanks for any help you can give.
@richard-schwab Directly before the pipeline creates the image. I am now on the wrong computer. If needed, I will copy you the related part later.
Maybe also @anton-l @pcuenca @NouamaneTazi if you have any tips / hints here
Hi @Marcophono2! Interesting comments. My 3090 runs at about 20 it/s too (with xFormers efficient attention enabled). It gets a bit better as you increase the batch size, up to ~26 it/s for a large batch size of 24 images at once. I'm surprised that your 4090 is capable of performing inference twice as fast, I'd be really happy about it! I don't have a 4090, but we may able to test on one soon.
Hi @Marcophono2! Interesting comments. My 3090 runs at about 20 it/s too (with xFormers efficient attention enabled). It gets a bit better as you increase the batch size, up to ~26 it/s for a large batch size of 24 images at once. I'm surprised that your 4090 is capable of performing inference twice as fast, I'd be really happy about it! I don't have a 4090, but we may able to test on one soon.
could you mention how you built for the 3090? im still getting "efficient_attention_forward_cutlass" errors even after trying all kinds of builds.
%env TORCH_CUDA_ARCH_LIST = "8.6"
%env FORCE_CUDA = "1"
%env CUDA_VISIBLE_DEVICES = 0
%pip install --no-clean git+https://github.com/facebookresearch/xformers#egg=xformers
Hi @Marcophono2 could you please detail the steps that you've followed to get the 17 it/s? I have a 4090 and I am at 10 it/s as well.
CUDA 12.0 Driver 525 Ubuntu 22.04
@harishprabhala 17 it/s? That is long time ago. I am at 42 it/s since a few weeks. π I also wrote that here some posts later. I will come back to you later today.
@harishprabhala 42 it/s on SD 1.5. Running on SD2.1 at 768x768px I am at 20 it/s.
@Marcophono2 I am curious as to how to got to 24 it/s without xformers. I am only interested in PyTorch performance. with voltaML (TensorRT) I am getting 84 it/s :)
@harishprabhala
with voltaML (TensorRT) I am getting 84 it/s :)
WHAT?? And I thought I am the champ here in this repository with my 42 it/s. π So your idea is that you could improve your performance by 70%? Related to my 17 it/s to your 10 it/s without xformers. It seems that I am out of date with my software. I have the 5.20.61.05 geforce driver (which is still the latest one as I had installed the studio driver while you have installed the games ready driver; may be this is the reason..) Also I did not know that cuda 12 is already available. I still have 11.8.
What finally brought me the punch from 9.5 to 17 it/s (without xformers) was to replace the cudnn files in /home/{user}/anaconda3/envs/{envname}/lib/python3.10/site-packages/torch/lib
In my case I used a package named "cudnn-linux-x86_64-8.6.0.163_cuda11-archive" but as you have a newer cuda version I think also your cudnn version is newer. I have 11.8.89 -> nvcc -V What is your version?
I'm also interested as the investment in a 4090 has not paid off yet, I've successfully compiled xformers with pytorch 1.13.1 cu117, but single image sd 1.4 512x512 is 11.5 it/s .
I replaced the lib files and got up to 20 it/s. but no where near 40. This is for a single image.
[+] bitsandbytes version 0.35.0 installed. [+] diffusers version 0.10.2 installed. [+] transformers version 4.25.1 installed. [+] xformers version 0.0.15.dev0+103e863.d20221120 installed. [+] torch version 1.13.1+cu117 installed. [+] torchvision version 0.14.1+cu117 installed.
NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA Graphics... On | 00000000:01:00.0 Off | Off | | 30% 39C P8 25W / 450W | 3839MiB / 24564MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1173 G /usr/lib/xorg/Xorg 13MiB | | 0 N/A N/A 1371 G /usr/bin/gnome-shell 12MiB | | 0 N/A N/A 4147066 C python3 3808MiB |
@sile16 Hi Matt! I think you did not compile xformers correctly with cuda 11.8. Can you tell me first the value you get by entering
nvcc -V
@sile16 Replacing the lib files brings one improvement for the direct calculations. Compiling xformers basing on cuda 11.8 brings a second punch. cuda 12 is available but I still have no experience with is together with xformers. Tomorrow I get the last component for a second server. Then I will set up a system with Nvidia's game-ready driver which hat a higher version no and also cuda 12.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
Hello!
Since 10 days, nearly round the clock, I try to bring my brand new and proudly owned Geforce RTX 4090 graphic cards to work appropriate with Stable Diffusion. But finally, 10 days later at least, it is still around 50% below its options.
In that 240 hours I changed from Ubuntu to Manjaro (and from Manjaro back to Ubuntu, via Pop OS back again to Ubuntu and to Manjaro Nightly, containing all Nvidia support, more or less working). Ubuntu absolutely did not allow me to bring together 22.10, 22.04 or 20.04 with my AMD hardware
Threadripper Pro 3955WX ASUS PRO WS WRX80E-SAGE
and my graphic card RTX 4090.
Yes, really, it was not the 4090. It was the mainboard and the cpu which made the big trouble since one of the newer Ubuntu OS versions. Or, the other way round: Ubuntu is the (damn) trouble maker. After about 50 re-installations I replaced the 4090 with a Geforce 2070, started from the scratch and found myself again (and again) in the same position: Yelling and cursing! Still that same issues.
Meanwhile, yes better now, I could bring together Manjaro with CUDA 11.8, Nvidia Driver Version: 520.56.06, Cuda compilation tools 11.8, V11.8.89, (build cuda_11.8.r11.8/compiler.31833905_0)
and used the nightly PyTorch version 1.13
Benchmark results:
with RTX 3090 (512x512) standard, fp16 12.7 it/s
with RTX 3090 (512x512) fp16, prepared unet optimization 14.9 it/s
with RTX 4090 (512x512) fp16, with or without optimization 11.5 it/s
So, at the end (of my long issue description) there is still the question: Why is SD, by all software and driver support so weak on a RTX 4090 compared to a RTX 3090?
I know that xFormers is an impressing performance boost and advantage lately. But I excluded xFormers in m< banchmarks.
Can anyone help me? I am frustrated meanwhile. If someone can help me to fix this missing link between SD, Nvidia support , PyTorch and my hardware, I am generous.
Best regards Marc
Reproduction
No response
Logs
No response
System Info
Manjaro Nightly
Threadripper Pro 3955WX ASUS PRO WS WRX80E-SAGE
CUDA 11.8, Nvidia Driver Version: 520.56.06, Cuda compilation tools 11.8, V11.8.89, (build cuda_11.8.r11.8/compiler.31833905_0)
nightly PyTorch version 1.13