patientx / ComfyUI-Zluda

The most powerful and modular stable diffusion GUI, api and backend with a graph/nodes interface. Now ZLUDA enhanced for better AMD GPU performance.
GNU General Public License v3.0
162 stars 11 forks source link

Very Slow Performance #39

Closed GoudaCouda closed 2 weeks ago

GoudaCouda commented 2 weeks ago

Your question

I am running with 40gb of ram and 12gb of vram on a 6700xt.I am getting around 9s/it and around 3 minutes to generate an image on Flux q3.Is this normal speeds for using Zluda? I am using quad cross and using euler simple.Are there any optimizations?

Logs

2024-11-07 13:29:56.514 [Debug] [ComfyUI-1/STDERR] Unloading models for lowram load.
2024-11-07 13:29:56.803 [Debug] [ComfyUI-1/STDERR] 0 models unloaded.
2024-11-07 13:29:56.806 [Debug] [ComfyUI-1/STDERR] 
2024-11-07 13:30:05.870 [Debug] [ComfyUI-1/STDERR]   0%|          | 0/20 [00:00<?, ?it/s]
2024-11-07 13:30:14.734 [Debug] [ComfyUI-1/STDERR]   5%|Γûî         | 1/20 [00:09<02:52,  9.06s/it]
2024-11-07 13:30:23.634 [Debug] [ComfyUI-1/STDERR]  10%|Γûê         | 2/20 [00:17<02:41,  8.95s/it]
2024-11-07 13:30:32.533 [Debug] [ComfyUI-1/STDERR]  15%|ΓûêΓûî        | 3/20 [00:26<02:31,  8.93s/it]
2024-11-07 13:30:41.486 [Debug] [ComfyUI-1/STDERR]  20%|ΓûêΓûê        | 4/20 [00:35<02:22,  8.91s/it]
2024-11-07 13:30:50.363 [Debug] [ComfyUI-1/STDERR]  25%|ΓûêΓûêΓûî       | 5/20 [00:44<02:13,  8.93s/it]
2024-11-07 13:30:59.213 [Debug] [ComfyUI-1/STDERR]  30%|ΓûêΓûêΓûê       | 6/20 [00:53<02:04,  8.91s/it]
2024-11-07 13:31:08.077 [Debug] [ComfyUI-1/STDERR]  35%|ΓûêΓûêΓûêΓûî      | 7/20 [01:02<01:55,  8.89s/it]
2024-11-07 13:31:16.938 [Debug] [ComfyUI-1/STDERR]  40%|ΓûêΓûêΓûêΓûê      | 8/20 [01:11<01:46,  8.88s/it]
2024-11-07 13:31:25.812 [Debug] [ComfyUI-1/STDERR]  45%|ΓûêΓûêΓûêΓûêΓûî     | 9/20 [01:20<01:37,  8.88s/it]
2024-11-07 13:31:34.715 [Debug] [ComfyUI-1/STDERR]  50%|ΓûêΓûêΓûêΓûêΓûê     | 10/20 [01:29<01:28,  8.87s/it]
2024-11-07 13:31:43.599 [Debug] [ComfyUI-1/STDERR]  55%|ΓûêΓûêΓûêΓûêΓûêΓûî    | 11/20 [01:37<01:19,  8.88s/it]
2024-11-07 13:31:52.470 [Debug] [ComfyUI-1/STDERR]  60%|ΓûêΓûêΓûêΓûêΓûêΓûê    | 12/20 [01:46<01:11,  8.88s/it]
2024-11-07 13:32:01.336 [Debug] [ComfyUI-1/STDERR]  65%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûî   | 13/20 [01:55<01:02,  8.88s/it]
2024-11-07 13:32:10.211 [Debug] [ComfyUI-1/STDERR]  70%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûê   | 14/20 [02:04<00:53,  8.88s/it]
2024-11-07 13:32:19.108 [Debug] [ComfyUI-1/STDERR]  75%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî  | 15/20 [02:13<00:44,  8.88s/it]
2024-11-07 13:32:28.038 [Debug] [ComfyUI-1/STDERR]  80%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê  | 16/20 [02:22<00:35,  8.88s/it]
2024-11-07 13:32:36.976 [Debug] [ComfyUI-1/STDERR]  85%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî | 17/20 [02:31<00:26,  8.90s/it]
2024-11-07 13:32:45.845 [Debug] [ComfyUI-1/STDERR]  90%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê | 18/20 [02:40<00:17,  8.91s/it]
2024-11-07 13:32:54.700 [Debug] [ComfyUI-1/STDERR]  95%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûî| 19/20 [02:49<00:08,  8.90s/it]
2024-11-07 13:32:54.700 [Debug] [ComfyUI-1/STDERR] 100%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê| 20/20 [02:57<00:00,  8.88s/it]
2024-11-07 13:32:54.700 [Debug] [ComfyUI-1/STDERR] 100%|ΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûêΓûê| 20/20 [02:57<00:00,  8.89s/it]
2024-11-07 13:32:54.701 [Debug] [ComfyUI-1/STDERR] Requested to load AutoencodingEngine
2024-11-07 13:32:54.701 [Debug] [ComfyUI-1/STDERR] Loading 1 new model
2024-11-07 13:32:54.838 [Debug] [ComfyUI-1/STDERR] loaded completely 0.0 159.87335777282715 True
2024-11-07 13:33:06.308 [Debug] [ComfyUI-1/STDERR] Prompt executed in 189.92 seconds

Other

No response

patientx commented 2 weeks ago

Normal. I get around 11-12 sec with rx 6600. Try this one : https://civitai.com/models/645943?modelVersionId=768009 It is an hybrid model of dev and schnell, can give good results even at 8 steps or less. Also you can use loras with it. (although I suggest merging lora's into the model , if you use lora in a workflow the overall speed drops, just merge your best loras with your model and keep that merged model in use) It goes into unet folder on models, you use the same dualclip and use unet loader instead of gguf loader. I get same speeds whether I use gguf or these fp8 models even with my 8 gb vram.