Open VeteranXT opened 3 months ago
I recently make it worked with zluda and got the same problem
Me too. The progress bar rushes to 100% in a couple of seconds. But after that it takes its normal time for decoding.
I temporarily fixed it by enabling refiner or adetailer
I already had adetailer enabled by default when the problem occured.
the problem persist in my case
i came here for this. I was shocked, 20+ it/s but the VAE process takes 3 times what i used to do.
i came here for this. I was shocked, 20+ it/s but the VAE process takes 3 times what i used to do.
VAE is fine, its bug that shows unreal speed of progress.
Confirming the problem. Conducted a series of experiments.
Since only some users are affected: it seems to be related to ZLUDA. I'm using a Ryzen 9 6900HX Rembrandt APU, Windows 11, Python 3.10.11, HIP SDK 6.1.2 + ROCmLibs for gfx1035 with ZLUDA.
I'm using older Version of HIP SDK. Never updated HIP/Roocmlibs.
Enabling Control net, then preview is okay.
interestingly i noticed the UNIPC scheduler does not show the issue. certain extensions also trigger the clock timing to become realistic.
In CPU's view, GPU is an I/O device. Although CPU requests GPU to execute something, when the requested task is done, actually, CPU does not know whether it is or not yet. Therefore, the CPU should synchronize the state of GPU. However, it does not synchronize every call due to performance. That means the programmer should synchronize the state in order to get proper results from GPU. Fortunately, torch does this synchronization work instead of us. It synchronizes, for example, when we print tensor, detach tensor from GPU, etc. In your case, for some reason, the synchronization wasn't done successfully during generation (in each sampler step). However, to convert the final latent as an image, the synchronization should occur at least at the last tensor detachment. Therefore, when it synchronizes, GPU has lots of tasks to run, but very few tasks are done. It leads "the last" synchronization to take a really long time. For now, the reason why synchronization fails is unknown. I haven't tried to find out the reason yet. Maybe it is a bug of AMD Comgr or ZLUDA itself. It seems to be able to appear suddenly and disappear whenever. So, I can't tell you a reason or a clear solution at this moment.
Thanks for explanation.
I think it is VAE problem with ZLUDA or AMD rocm.
I use comfyui, after I change directml to ZLUDA, sampler works 2-3x faster, but VAE is very slow.
The comfyui can show each node time used.
Also i add some debug info in sd.py vae.decode function
, found that is ave stage, not sampler.
RT: VAE decode memory_used=3853910016 free_memory=4493349376 batch_number=1 vae_dtype=torch.float16
For 768x1152 vae decode, directml take 0.5-1s, but ZLUDA take 6-8s.
Then i add more test, found that ZLUDA vae speed is strongly related image resolution. Here are some ZLUDA ave decode results: 512x512: 0.5-1s 768x768: 3-4s 960x960: 6-8s 1024x1024: 13-15s
Directml use about 1s, no matter how resolution changed.
Checklist
What happened?
Works very fast. But its really not. It does usual speed. Like txt2Image does 7 seconds old commits. Now it shows 23 it/s instead of 2-3its/s but VAE decode takes same time as if didn't. i've installed new fresh to check and it works 10x faster to make it, but VAE works same for some reason. 1576x1576 shows done in 20 sec. but stops at last step and VAE is working to decode. Also Live Preview can't show anything due to "speed"
Steps to reproduce the problem
Any prompt and look Console.
What should have happened?
Show actual speed.
What browsers do you use to access the UI ?
Firefox.
Sysinfo
sysinfo-2024-07-27-00-00.json
Console logs
Additional information
No response