🚀 Precompiled xFormers for CUDA 12.4 and PyTorch 2.4 Compatibility

sashaok123 commented 3 months ago

I have requested from the developers of xformers, A precompiled version of xFormers that is compatible with CUDA 12.4 and PyTorch 2.4. https://github.com/facebookresearch/xformers/issues/1079

They have compiled precompiled wheels for CUDA 12.4 and PyTorch 2.4 https://github.com/facebookresearch/xformers/actions/runs/10559887009

Now you can fully add xformers to the fresh Forge

sashaok123 commented 3 months ago

(Viruses on board!)

How can I ban a user or send a complaint to the administrators?

wuliaodexiaoluo commented 3 months ago

How can I ban a user or send a complaint to the administrators?

You can report inauthentic account activity to GitHub Support by visiting the account home page, where there is a Block or report option under the account avatar

EDIT:By the way,this account was been deleted

sais-github commented 3 months ago

https://app.any.run/tasks/abb4419a-a8cb-4707-946d-e73a9d3561bb Usual Lumma stealer... I don't know if you get notifications for every message in an issue so: @lllyasviel bad files x.x

dongxiat commented 3 months ago

for who dont know how to install xformers w\ CUDA 12.4 and PyTorch 2.4 and python 3.10

read this

link to download .whl file : https://github.com/facebookresearch/xformers/actions/runs/10559887009

sais-github commented 3 months ago

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

HMRMike commented 3 months ago

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

adrianschubek commented 3 months ago

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

Yes! Did some benchmarks on my RTX 3070 with Flux Q8, 28steps, euler, simple, 1024x1024: Forge with CUDA 12.1 + Pytorch 2.3.1: 3.61s/it Forge with CUDA 12.4 + Pytorch 2.4: 3.05s/it (15% faster) Forge with CUDA 12.4 + Pytorch 2.4 + xformers: 2.85s/it (21% faster)

yamfun commented 3 months ago

wow

l33tx0 commented 3 months ago

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config

using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

HMRMike commented 3 months ago

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config

using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

2.2 definitely seems a bit on the slow side for Q8. I use pretty much the same args usually. Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings? To retest in my current updated commit (stuff changes in 5 days): 1024x1024, Euler Simple, 30 steps, Queue/Shared swap. Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec.

Versions from UI bottom: version: f2.0.1v1.10.1-previous-495-g4f64f6da • python: 3.10.6 • torch: 2.4.0+cu124 • xformers: 0.0.28.dev893+cu124 • gradio: 4.40.0 • checkpoint: d9b5d2777c

l33tx0 commented 3 months ago

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

2.2 definitely seems a bit on the slow side for Q8. I use pretty much the same args usually. Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings? To retest in my current updated commit (stuff changes in 5 days): 1024x1024, Euler Simple, 30 steps, Queue/Shared swap. Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec.

Versions from UI bottom: version: f2.0.1v1.10.1-previous-495-g4f64f6da • python: 3.10.6 • torch: 2.4.0+cu124 • xformers: 0.0.28.dev893+cu124 • gradio: 4.40.0 • checkpoint: d9b5d2777c

when i start i got an error like this pytorch version: 2.4.0+cu124 WARNING:xformers:A matching Triton is not available, some optimizations will not be enabled Traceback (most recent call last): File "..forge\venv\lib\site-packages\xformers\__init__.py", line 57, in _is_triton_available import triton # noqa ModuleNotFoundError: No module named 'triton' xformers version: 0.0.28.dev893+cu124 Set vram state to: NORMAL_VRAM VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: True Using xformers cross attention Using xformers attention for VAE

with sdxl i'm getting 3.57it/s night rain ancient era Steps: 20, Sampler: DPM++ 2M SDE, Schedule type: Karras, CFG scale: 7.5, Seed: 3617511334, Size: 1024x1024, Model hash: 7b91764cf2, Model: copaxTimelessxlSDXL1_v122, Version: f2.0.1v1.10.1-previous-501-g668e87f9, Module 1: sdxl_vae_fp16_fixv2, Source Identifier: Stable Diffusion web UI

if you can confirm the issue is only with flux or my installation

HMRMike commented 3 months ago

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

2.2 definitely seems a bit on the slow side for Q8. I use pretty much the same args usually. Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings? To retest in my current updated commit (stuff changes in 5 days): 1024x1024, Euler Simple, 30 steps, Queue/Shared swap. Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec. Versions from UI bottom: version: f2.0.1v1.10.1-previous-495-g4f64f6da • python: 3.10.6 • torch: 2.4.0+cu124 • xformers: 0.0.28.dev893+cu124 • gradio: 4.40.0 • checkpoint: d9b5d2777c

when i start i got an error like this pytorch version: 2.4.0+cu124 WARNING:xformers:A matching Triton is not available, some optimizations will not be enabled Traceback (most recent call last): File "..forge\venv\lib\site-packages\xformers\__init__.py", line 57, in _is_triton_available import triton # noqa ModuleNotFoundError: No module named 'triton' xformers version: 0.0.28.dev893+cu124 Set vram state to: NORMAL_VRAM VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: True Using xformers cross attention Using xformers attention for VAE

with sdxl i'm getting 3.57it/s night rain ancient era Steps: 20, Sampler: DPM++ 2M SDE, Schedule type: Karras, CFG scale: 7.5, Seed: 3617511334, Size: 1024x1024, Model hash: 7b91764cf2, Model: copaxTimelessxlSDXL1_v122, Version: f2.0.1v1.10.1-previous-501-g668e87f9, Module 1: sdxl_vae_fp16_fixv2, Source Identifier: Stable Diffusion web UI

if you can confirm the issue is only with flux or my installation

Yeah the Triton thing is only for Linux, apparently. It's not a real issue on Windows, you can ignore this message. https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/7115

I'm getting almost exactly the same speed with SDXL (it fluctuates, up to 3.6, but effectively identical to yours) and these settings, so it leaves something weird with Flux. Just to make sure, I'm using the Q8 model from here: https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main Otherwise all the versions seem identical, we get the same startup stuff. Even without xformers it should be quite a bit faster. Just as a sanity check in such cases I like to just git clone a fresh copy and see if there are any differences, maybe erase the VENV folder and let stuff rebuild if the fresh copy was indeed faster. Makes hunting for a specific issue less frustrating.

Hujikuio commented 2 weeks ago

for who dont know how to install xformers w\ CUDA 12.4 and PyTorch 2.4 and python 3.10

read this

link to download .whl file : https://github.com/facebookresearch/xformers/actions/runs/10559887009

Can you explain how to install the wheel in Forge without a venv (the cuda12.4 / pytorch2.4 .zip on main page)? I know it uses embedded python and sets the paths via environment.bat, but I still can't get pip to work.

EDIT: I think I figured it out, it's the same with ComfyUI's embedded python.

The embedded python.exe is in system\python\python.exe then you just add -m pip install after the .exe

You can laugh at me now.

lllyasviel / stable-diffusion-webui-forge

🚀 Precompiled xFormers for CUDA 12.4 and PyTorch 2.4 Compatibility #1553