patientx / ComfyUI-Zluda

The most powerful and modular stable diffusion GUI, api and backend with a graph/nodes interface. Now ZLUDA enhanced for better AMD GPU performance.
GNU General Public License v3.0
116 stars 7 forks source link

Increase speed #22

Open cyber827 opened 2 weeks ago

cyber827 commented 2 weeks ago

Feature Idea

Found this comment by @Exploder98 suggesting removing bfloat16 which increased my speed by 50%, modifying

supported_inference_dtypes = [torch.bfloat16, torch.float16, torch.float32]

to

supported_inference_dtypes = [torch.float16, torch.float32]

in https://github.com/comfyanonymous/ComfyUI/blob/7df42b9a2364bae6822fbd9e9fa10cea2e319ba3/comfy/supported_models.py#L645

Additionally, running optimization through PyTorch TunableOp could be tried which did not work for me but others confirmed it worked, maybe a script could be created for it.

Existing Solutions

No response

Other

https://github.com/city96/ComfyUI-GGUF/issues/48#issuecomment-2308413117

patientx commented 2 weeks ago

Going to check when I have time, thanks.

patientx commented 2 weeks ago

Ok, removing bfloat from flux model support really gave 2x speedup , it now defaults to fp16. I am working on the other one.

pw405 commented 2 weeks ago

Oh cool, I wondered about that in my [Reddit Post]! (https://www.reddit.com/r/FluxAI/comments/1eztuch/flux_on_amd_gpus_rdna3_wzluda/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)

I ran the update and did some testing, confirmed the float type is no longer bfloat16: image

The speed is the same though - about 2 seconds per it. (7900 XTX, 32GB RAM, Windows 10, Radeon 24.8.1 driver).

What sampler/scheduler are you seeing the speed increase with?

patientx commented 2 weeks ago

euler ,simple

cyber827 commented 2 weeks ago

Oh cool, I wondered about that in my [Reddit Post]! (https://www.reddit.com/r/FluxAI/comments/1eztuch/flux_on_amd_gpus_rdna3_wzluda/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)

I ran the update and did some testing, confirmed the float type is no longer bfloat16: image

The speed is the same though - about 2 seconds per it. (7900 XTX, 32GB RAM, Windows 10, Radeon 24.8.1 driver).

What sampler/scheduler are you seeing the speed increase with?

Euler - Simple, try using --force-fp32 or --force-fp16 and if no improvement then --use-split-cross-attention

Ok, removing bfloat from flux model support really gave 2x speedup , it now defaults to fp16. I am working on the other one.

I got the TunableOp working by putting in the start.bat:

set PYTORCH_TUNABLEOP_ENABLED=1
set PYTORCH_TUNABLEOP_VERBOSE=1
set PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED=0

The .csv file is created but the cmd can't write to it. It either does not support Windows/Zluda and needs to be tricked that it uses ROCM or like someone said in the comments main.py needs to be run directly which for me gives an error.

patientx commented 2 weeks ago

I can run comfyui without putting zluda in front of python ... in batch file and this way it works.

But lets try this, update comfy. There is a new batch file which enables this tunableop , try with sd1.5 for example 3 consecutive runs etc. Then you have "exit" with CTRL-C on cmd window , and you can see it is writing to csv file although I haven't tried with zluda in front to be honest.

In the end, It is working , BUT when I tried it with sd1.5 & sdxl and also flux , there wasnt much difference. Maybe needs more testing. Maybe I am / we are already doing other speed up tricks to do what this intends to do.

EDIT :::: Nope it doesn't work with zluda in front ... You can try to add zluda folder to path in windows path. type "env" in when start menu is open, it is going to show you a shortcut to "edit the system enviroment variables" , click enviroment variables, look for "path" in the bottom section, add zluda folder to the system path (from the comfyui-zluda folder probably gonna work) for example : " D:\ComfyUI-Zluda\zluda " , restart system to be sure. remove the part ".\zluda\zluda.exe -- " just start with "%PYTHON% main.py %COMMANDLINE_ARGS%" on "start-tunableop-novram.bat".

cyber827 commented 2 weeks ago

Added the zluda path. It runs with both:

.\zluda\zluda.exe -- %PYTHON% main.py %COMMANDLINE_ARGS%

and

%PYTHON% main.py %COMMANDLINE_ARGS%

But I get a similar error as running it from start.bat, the tunableop_results0.csv remains empty:

error with .\zluda\zluda.exe -- %PYTHON% main.py %COMMANDLINE_ARGS%

reading tuning results from tunableop_results0.csv could not open tunableop_results0.csv for reading tuning results missing op_signature, returning null ResultEntry

finding fastest for GemmTunableOp_Half_TN(tn_3072_4096_64) out of 1 candidates Ôö£ÔöÇÔöÇverify numerics: atol=1e-05, rtol=1e-05 Ôö£ÔöÇÔöÇtuning using warmup iters 1 [0.4353 ms] and tuning iters 68 [29.6004 ms] instance id=0, GemmTunableOp_Half_TN(tn_3072_4096_64) Default Ôö£ÔöÇÔöÇfound better instance id=0. 0.156974ms. Default ÔööÔöÇÔöÇfound fastest for GemmTunableOp_Half_TN(tn_3072_4096_64) Default GemmTunableOp_Half_TN(tn_3072_4096_64) -> Default,0.156974 missing params_signature, returning null ResultEntry

error with %PYTHON% main.py %COMMANDLINE_ARGS%

reading tuning results from tunableop_results0.csv key="PT_VERSION" is not provided for validation. results validator check failed missing op_signature, returning null ResultEntry

finding fastest for GemmTunableOp_Half_TN(tn_3072_4096_64) out of 1 candidates Ôö£ÔöÇÔöÇverify numerics: atol=1e-05, rtol=1e-05 Ôö£ÔöÇÔöÇtuning using warmup iters 1 [0.281467 ms] and tuning iters 100 [28.1467 ms] instance id=0, GemmTunableOp_Half_TN(tn_3072_4096_64) Default Ôö£ÔöÇÔöÇfound better instance id=0. 0.169283ms. Default ÔööÔöÇÔöÇfound fastest for GemmTunableOp_Half_TN(tn_3072_4096_64) Default GemmTunableOp_Half_TN(tn_3072_4096_64) -> Default,0.169283 missing params_signature, returning null ResultEntry

patientx commented 2 weeks ago

are these files specific to gpu or model or what ? maybe we can exchange them at least with same models ? I have a rx6600

cyber827 commented 2 weeks ago

Seems to be specific to each run, similar to the zluda.db file

The first time any TunableOp is invoked, the internal database of tuned operations will be prepared by attempting to read the results from the given file. The default filename is 'tunableop_results.csv'. To support tuning when multiple GPUs are used across multiple processes, the GPU device ordinal is automatically inserted into the filename to avoid multiple processes overwriting the same file.

If tuning is enabled and new tunings are discovered during the course of your workload, it will also write out to this same filename with all tunings, both the ones it read in at startup as well as the new ones found at runtime. This can be used, for example, to build up a tunings file across many workloads by reusing the same file. The output file is automatically created when the application terminates. This behavior can be controlled by the C++ and Python APIs but not the environment variables.

Assuming you specified a filename, you'll end up with a CSV file with contents like so:

Validator,PT_VERSION,2.2.0
Validator,ROCM_VERSION,6.0.0.0-12969-1544e39
Validator,HIPBLASLT_VERSION,0.6.0-a9c5cc7
Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty
GemmTunableOp_float_NT,nt_25088_4096_64,1219,1.262
GemmTunableOp_float_NT,nt_4096_4096_64,1216,0.033
Note the "Validator" lines. If you change a library verison, or ROCm version, or PyTorch version, TunableOp will detect this and reject the tunings file because the prior tunings are likely affected by other software changes.

The remaining lines are the tuned solutions for each TunableOp encountered during your execution. Each line consists of 4 comma-separated fields: operator name, operator parameters, solution name, and average execution time. The execution time is an optional field. The CSV file can be edited, but with caution. For example, the solution name (field 3) can be changed to "Default" and it will fall back to the original PyTorch untuned implementation. Or, in the case of ROCm's hipBLAS or hipBLASLt libraries, if you know the specific solution index you can override the solution that TunableOp selected by replacing the value. The operator name and parameters (fields 1 and 2) are internally named and should not be modified. In the case of GemmTunableOp, field 1 indicates the datatype and whether the inputs are transposed (T) or not (N) and field 2 indicates the M, N, K input shapes.

There is an option to enable verbose output but it is only recommended for debugging purposes. This will produce a lot of diagnostic messages but may be useful to see if TunableOp is being used at all. Otherwise, TunableOp is completely silent, besides file output, unless there is a warning or error during its use.

https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable#environment-variable-interface

maybe we can exchange them at least with same models ? I have a rx6600

I have an RX 6700, but the Validator,PT_VERSION and Validator,ROCM_VERSION should be the same or similar from when comfyui-zluda was installed. Could you open the tunableop_results0 and copy paste the

`Validator,PT_VERSION,
Validator,ROCM_VERSION,
Validator,HIPBLASLT_VERSION,
Validator,ROCBLAS_VERSION,`

I'll try to edit the .csv file manually and run the .bat again

patientx commented 2 weeks ago

No problem sharing the whole file, only pt version is there it is working this way though.

[tunableop_results0.csv](https://github.com/user-attachments/files/16819972/tunableop_results0.csv)

pw405 commented 2 weeks ago

Early results, haven't tested much yet!

After most recent update, generation time went from ~2 seconds/it to ~45 seconds/it when using the t5XXL_FP16 CLIP.

Using the FP8 clip, seeing about 8 seconds/it. Odd. Still using same torch.float16 data type.

Let me tinker a bit more and I'll see if I can get some more conclusive info.

Valekbest commented 2 weeks ago

supported_inference_dtypes = [torch.float16, torch.float32]

it's works my 16s/it changes to 8s/it

After the last update everything started working much slower, I collected information, some changes in the main branch caused a slowdown.

My settings: euler simple 2 steps h/w 1216*832 flux1-schnell-fp8.safetensors My Hardware: RX6750 XT 12G 2752Mhz / vram 2238Mhz

checkout aeab6d1370ff2a0b1cd740db5fd18f667bc1cb18

normal start
8.13-8.23 s/it

tunableop start after 1st generation 8.00-8.23 tunableop_results0.csv created but forever empty after image generation

after checkout 51af2440efd178f2b9c2dc3dc1bba6992542a8dc unreally slow loading checkpoint >3min, on VAE decode get error: RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
sampler generation >200s/it

pw405 commented 2 weeks ago

Having huge increases in execution time too!! I backed up the last two logs to see if I could cobble together anything helpful. Thankfully had one from early in the morning that was working.

(I'm in Central time, US (UTC-5) FYI for Timestamp reference)

In my case, Pytorch cross attention was used by default in the newer release.

Previously, it was using Sub quadratic optimization for cross attention.

I added command --use-quad-cross-attention to my command line arg's and Its back to running the FP16 CLIP at about 2 seconds/it!!

image

Valekbest commented 2 weeks ago

Has anyone managed to launch SUPIR?

cyber827 commented 2 weeks ago

No problem sharing the whole file, only pt version is there it is working this way though.

[tunableop_results0.csv](https://github.com/user-attachments/files/16819972/tunableop_results0.csv)

Thanks, it did pick up the config from the .csv, managed in the end to generate the .csv too by running as admin the cmd, unfortunately I did not notice more than a 1s/it increase which could be random too. I'll keep it on for now since I noticed the VRAM consumed is decreased a bit with it.

supported_inference_dtypes = [torch.float16, torch.float32]

it's works my 16s/it changes to 8s/it

After the last update everything started working much slower, I collected information, some changes in the main branch caused a slowdown.

My settings: euler simple 2 steps h/w 1216*832 flux1-schnell-fp8.safetensors My Hardware: RX6750 XT 12G 2752Mhz / vram 2238Mhz

checkout aeab6d1

normal start 8.13-8.23 s/it

tunableop start after 1st generation 8.00-8.23 tunableop_results0.csv created but forever empty after image generation

after checkout 51af244 unreally slow loading checkpoint >3min, on VAE decode get error: RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR sampler generation >200s/it

I get an increase in generation time whenever the main branch is updated by @comfyanonymous, usually restarting the pc fixes it, might have something to do with zluda.db or pagefile

Having huge increases in execution time too!! I backed up the last two logs to see if I could cobble together anything helpful. Thankfully had one from early in the morning that was working.

(I'm in Central time, US (UTC-5) FYI for Timestamp reference)

In my case, Pytorch cross attention was used by default in the newer release.

Previously, it was using Sub quadratic optimization for cross attention.

I added command --use-quad-cross-attention to my command line arg's and Its back to running the FP16 CLIP at about 2 seconds/it!!

image

Have you tried using --force-fp32 or a GGUF unet? https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main

patientx commented 2 weeks ago

Main comfy keeps changing by the hour, I add the changes test with just one generation than apply it, so if there is some problem it is mostly caused by main , if that is huge one it is usually fixed very quickly. Regarding zluda and small changes I try to keep track of them.

Usually restarting after an update , especially which changes one of the main py files works for the best.

I myself only have an rx 6600 with 8 gb vram with 16 gb system ram so when I say there is %X speed change or Y is giving OOM problems that could also be because of my system. With models such as FLUX we are already well into the dangerous open sea territory :) At least with GPU's similar to mine.

There are two ways I can suggest might improve speeds & / or memory ;

1-) Using q4 gguf versions of both schnell and dev works great also for clip , using dual clip , 1st clip_l, second t5xxl_fp8_e4m3fn.safetensors. There is also GGUF versions of t5 clips but that didn't do much impact regarding memory or speed at least in my pc.

2-) Just found out about this model , https://civitai.com/models/645943?modelVersionId=722828 , somehow faster than that combo on the first part, also there are other model variants there which when used with dual clip combo in 1. part, seems to be a bit better than standard models or gguf models.

For reference, before fp16 change, with my setup I was getting around 35-40 seconds / it with both schnell and dev. After that speed increased twofold into 20 sec / it . Now the model I have shown in second part somehow gets better and gives me around 16 sec / it. It is almost as fast as I was getting one year ago with SDXL with the same system. (there was only directml back then)

greedy-n5q commented 2 weeks ago

2-) Just found out about this model , https://civitai.com/models/645943?modelVersionId=722828 , somehow faster than that combo on the first part, also there are other model variants there which when used with dual clip combo in 1. part, seems to be a bit better than standard models or gguf models.

Can you share your workflow? I get 8-9 it/s with gguf and 16it/s with this model on my 6800xt

patientx commented 2 weeks ago

2-) Just found out about this model , https://civitai.com/models/645943?modelVersionId=722828 , somehow faster than that combo on the first part, also there are other model variants there which when used with dual clip combo in 1. part, seems to be a bit better than standard models or gguf models.

Can you share your workflow? I get 8-9 it/s with gguf and 16it/s with this model on my 6800xt

If that values were reversed aka sec / it seems about the right speed you should be getting compared to a 8gb 6600 imo. Using standard workflows. Only thing I am using different is I am using --novram as a cmdline toggle this is usually a bit better for me regarding OOM.

edit : https://pastebin.com/tzqCDSHZ You can change model and step count, the unchained schnell model also very at just 4 steps.

weiping317 commented 2 weeks ago

Having huge increases in execution time too!! I backed up the last two logs to see if I could cobble together anything helpful. Thankfully had one from early in the morning that was working.

(I'm in Central time, US (UTC-5) FYI for Timestamp reference)

In my case, Pytorch cross attention was used by default in the newer release.

Previously, it was using Sub quadratic optimization for cross attention.

I added command --use-quad-cross-attention to my command line arg's and Its back to running the FP16 CLIP at about 2 seconds/it!!

image

I tried and it worked. Thanks a lot.