Open sashaok123 opened 3 months ago
(Viruses on board!)
How can I ban a user or send a complaint to the administrators?
How can I ban a user or send a complaint to the administrators?
You can report inauthentic account activity to GitHub Support by visiting the account home page, where there is a Block or report option under the account avatar
EDIT:By the way,this account was been deleted
https://app.any.run/tasks/abb4419a-a8cb-4707-946d-e73a9d3561bb Usual Lumma stealer... I don't know if you get notifications for every message in an issue so: @lllyasviel bad files x.x
for who dont know how to install xformers w\ CUDA 12.4 and PyTorch 2.4 and python 3.10
read this
link to download .whl
file : https://github.com/facebookresearch/xformers/actions/runs/10559887009
Is there a noticeable enough performance boost to want to reinstall with newer pytorch?
Is there a noticeable enough performance boost to want to reinstall with newer pytorch?
With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.
Is there a noticeable enough performance boost to want to reinstall with newer pytorch?
Yes!
Did some benchmarks on my RTX 3070 with Flux Q8, 28steps, euler, simple, 1024x1024:
Forge with CUDA 12.1 + Pytorch 2.3.1: 3.61s/it
Forge with CUDA 12.4 + Pytorch 2.4: 3.05s/it
(15% faster)
Forge with CUDA 12.4 + Pytorch 2.4 + xformers: 2.85s/it
(21% faster)
wow
Is there a noticeable enough performance boost to want to reinstall with newer pytorch?
With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.
can you share your command line args , i'm getting around 2.2 s/it with same config
using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream
Is there a noticeable enough performance boost to want to reinstall with newer pytorch?
With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.
can you share your command line args , i'm getting around 2.2 s/it with same config
using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream
2.2 definitely seems a bit on the slow side for Q8. I use pretty much the same args usually. Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings? To retest in my current updated commit (stuff changes in 5 days): 1024x1024, Euler Simple, 30 steps, Queue/Shared swap. Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec.
Versions from UI bottom: version: f2.0.1v1.10.1-previous-495-g4f64f6da  •  python: 3.10.6  •  torch: 2.4.0+cu124  •  xformers: 0.0.28.dev893+cu124  •  gradio: 4.40.0  •  checkpoint: d9b5d2777c
Is there a noticeable enough performance boost to want to reinstall with newer pytorch?
With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.
can you share your command line args , i'm getting around 2.2 s/it with same config using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream
2.2 definitely seems a bit on the slow side for Q8. I use pretty much the same args usually. Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings? To retest in my current updated commit (stuff changes in 5 days): 1024x1024, Euler Simple, 30 steps, Queue/Shared swap. Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec.
Versions from UI bottom: version: f2.0.1v1.10.1-previous-495-g4f64f6da  •  python: 3.10.6  •  torch: 2.4.0+cu124  •  xformers: 0.0.28.dev893+cu124  •  gradio: 4.40.0  •  checkpoint: d9b5d2777c
when i start i got an error like this
pytorch version: 2.4.0+cu124 WARNING:xformers:A matching Triton is not available, some optimizations will not be enabled Traceback (most recent call last): File "..forge\venv\lib\site-packages\xformers\__init__.py", line 57, in _is_triton_available import triton # noqa ModuleNotFoundError: No module named 'triton' xformers version: 0.0.28.dev893+cu124 Set vram state to: NORMAL_VRAM VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: True Using xformers cross attention Using xformers attention for VAE
with sdxl i'm getting 3.57it/s night rain ancient era Steps: 20, Sampler: DPM++ 2M SDE, Schedule type: Karras, CFG scale: 7.5, Seed: 3617511334, Size: 1024x1024, Model hash: 7b91764cf2, Model: copaxTimelessxlSDXL1_v122, Version: f2.0.1v1.10.1-previous-501-g668e87f9, Module 1: sdxl_vae_fp16_fixv2, Source Identifier: Stable Diffusion web UI
if you can confirm the issue is only with flux or my installation
Is there a noticeable enough performance boost to want to reinstall with newer pytorch?
With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.
can you share your command line args , i'm getting around 2.2 s/it with same config using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream
2.2 definitely seems a bit on the slow side for Q8. I use pretty much the same args usually. Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings? To retest in my current updated commit (stuff changes in 5 days): 1024x1024, Euler Simple, 30 steps, Queue/Shared swap. Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec. Versions from UI bottom: version: f2.0.1v1.10.1-previous-495-g4f64f6da  •  python: 3.10.6  •  torch: 2.4.0+cu124  •  xformers: 0.0.28.dev893+cu124  •  gradio: 4.40.0  •  checkpoint: d9b5d2777c
when i start i got an error like this
pytorch version: 2.4.0+cu124 WARNING:xformers:A matching Triton is not available, some optimizations will not be enabled Traceback (most recent call last): File "..forge\venv\lib\site-packages\xformers\__init__.py", line 57, in _is_triton_available import triton # noqa ModuleNotFoundError: No module named 'triton' xformers version: 0.0.28.dev893+cu124 Set vram state to: NORMAL_VRAM VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: True Using xformers cross attention Using xformers attention for VAE
with sdxl i'm getting 3.57it/s
night rain ancient era Steps: 20, Sampler: DPM++ 2M SDE, Schedule type: Karras, CFG scale: 7.5, Seed: 3617511334, Size: 1024x1024, Model hash: 7b91764cf2, Model: copaxTimelessxlSDXL1_v122, Version: f2.0.1v1.10.1-previous-501-g668e87f9, Module 1: sdxl_vae_fp16_fixv2, Source Identifier: Stable Diffusion web UI
if you can confirm the issue is only with flux or my installation
Yeah the Triton thing is only for Linux, apparently. It's not a real issue on Windows, you can ignore this message. https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/7115
I'm getting almost exactly the same speed with SDXL (it fluctuates, up to 3.6, but effectively identical to yours) and these settings, so it leaves something weird with Flux. Just to make sure, I'm using the Q8 model from here: https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main Otherwise all the versions seem identical, we get the same startup stuff. Even without xformers it should be quite a bit faster. Just as a sanity check in such cases I like to just git clone a fresh copy and see if there are any differences, maybe erase the VENV folder and let stuff rebuild if the fresh copy was indeed faster. Makes hunting for a specific issue less frustrating.
for who dont know how to install xformers w\ CUDA 12.4 and PyTorch 2.4 and python 3.10
read this
link to download
.whl
file : https://github.com/facebookresearch/xformers/actions/runs/10559887009
Can you explain how to install the wheel in Forge without a venv (the cuda12.4 / pytorch2.4 .zip on main page)? I know it uses embedded python and sets the paths via environment.bat, but I still can't get pip to work.
EDIT: I think I figured it out, it's the same with ComfyUI's embedded python.
The embedded python.exe is in system\python\python.exe
then you just add -m pip install
after the .exe
You can laugh at me now.
I have requested from the developers of xformers, A precompiled version of xFormers that is compatible with CUDA 12.4 and PyTorch 2.4. https://github.com/facebookresearch/xformers/issues/1079
They have compiled precompiled wheels for CUDA 12.4 and PyTorch 2.4 https://github.com/facebookresearch/xformers/actions/runs/10559887009
Now you can fully add xformers to the fresh Forge