lshqqytiger / stable-diffusion-webui-amdgpu

Stable Diffusion web UI
GNU Affero General Public License v3.0
1.87k stars 191 forks source link

[Bug]: Total Progress is much faster than usuall. Shows 10x speed but its not AMD ZLUDA #502

Open VeteranXT opened 3 months ago

VeteranXT commented 3 months ago

Checklist

What happened?

Works very fast. But its really not. It does usual speed. Like txt2Image does 7 seconds old commits. Now it shows 23 it/s instead of 2-3its/s but VAE decode takes same time as if didn't. i've installed new fresh to check and it works 10x faster to make it, but VAE works same for some reason. 1576x1576 shows done in 20 sec. but stops at last step and VAE is working to decode. Also Live Preview can't show anything due to "speed"

Steps to reproduce the problem

Any prompt and look Console.

What should have happened?

Show actual speed.

What browsers do you use to access the UI ?

Firefox.

Sysinfo

sysinfo-2024-07-27-00-00.json

Console logs

No errors

Additional information

No response

nicodem09 commented 3 months ago

I recently make it worked with zluda and got the same problem

Morpheus-79 commented 3 months ago

Me too. The progress bar rushes to 100% in a couple of seconds. But after that it takes its normal time for decoding.

nicodem09 commented 3 months ago

I temporarily fixed it by enabling refiner or adetailer

Morpheus-79 commented 3 months ago

I already had adetailer enabled by default when the problem occured.

nicodem09 commented 3 months ago

the problem persist in my case

Elise96nl commented 3 months ago

i came here for this. I was shocked, 20+ it/s but the VAE process takes 3 times what i used to do.

VeteranXT commented 3 months ago

i came here for this. I was shocked, 20+ it/s but the VAE process takes 3 times what i used to do.

VAE is fine, its bug that shows unreal speed of progress.

Kargim commented 3 months ago

Confirming the problem. Conducted a series of experiments.

  1. I had a commit “371f53e...0bde866”. The preview window works without problems. With speed and display in the console everything is fine.
  2. I tried installing commit “61aa844...67fdead”. It comes before the upgrade to 1.10. It has an ONNX error, but it is solved by applying “--skip-ort”. The preview window works without problems. With speed and display in the console everything is fine.
  3. If I update to commit “67fdead...235a1ff” (version 1.10) or higher, the preview window breaks immediately + problems with speed and display in the console =( Radeon RX 5500 XT 8Gb, Windows 10, python 3.10.11, HIP SDK 5.7.1 + ROCmLibs for old cards, Zluda
Morpheus-79 commented 2 months ago

Since only some users are affected: it seems to be related to ZLUDA. I'm using a Ryzen 9 6900HX Rembrandt APU, Windows 11, Python 3.10.11, HIP SDK 6.1.2 + ROCmLibs for gfx1035 with ZLUDA.

VeteranXT commented 2 months ago

I'm using older Version of HIP SDK. Never updated HIP/Roocmlibs.

VeteranXT commented 2 months ago

Enabling Control net, then preview is okay.

ride5k commented 2 months ago

interestingly i noticed the UNIPC scheduler does not show the issue. certain extensions also trigger the clock timing to become realistic.

lshqqytiger commented 4 weeks ago

In CPU's view, GPU is an I/O device. Although CPU requests GPU to execute something, when the requested task is done, actually, CPU does not know whether it is or not yet. Therefore, the CPU should synchronize the state of GPU. However, it does not synchronize every call due to performance. That means the programmer should synchronize the state in order to get proper results from GPU. Fortunately, torch does this synchronization work instead of us. It synchronizes, for example, when we print tensor, detach tensor from GPU, etc. In your case, for some reason, the synchronization wasn't done successfully during generation (in each sampler step). However, to convert the final latent as an image, the synchronization should occur at least at the last tensor detachment. Therefore, when it synchronizes, GPU has lots of tasks to run, but very few tasks are done. It leads "the last" synchronization to take a really long time. For now, the reason why synchronization fails is unknown. I haven't tried to find out the reason yet. Maybe it is a bug of AMD Comgr or ZLUDA itself. It seems to be able to appear suddenly and disappear whenever. So, I can't tell you a reason or a clear solution at this moment.

VeteranXT commented 4 weeks ago

Thanks for explanation.

roytan883 commented 3 weeks ago

I think it is VAE problem with ZLUDA or AMD rocm. I use comfyui, after I change directml to ZLUDA, sampler works 2-3x faster, but VAE is very slow. The comfyui can show each node time used. Also i add some debug info in sd.py vae.decode function, found that is ave stage, not sampler.

RT: VAE decode memory_used=3853910016 free_memory=4493349376 batch_number=1 vae_dtype=torch.float16

For 768x1152 vae decode, directml take 0.5-1s, but ZLUDA take 6-8s.

Then i add more test, found that ZLUDA vae speed is strongly related image resolution. Here are some ZLUDA ave decode results: 512x512: 0.5-1s 768x768: 3-4s 960x960: 6-8s 1024x1024: 13-15s

Directml use about 1s, no matter how resolution changed.