Closed gsgoldma closed 1 year ago
@gsgoldma - are you by chance using WSL2? I had that problem but after changing some of the CUDA settings and doing the latest git pull
I no longer have the problem. What CUDA settings are you using?
11.8 cuda, no wsl2
image shows 100% completion in the console, it may take 20-30 seconds
Have same trouble
Do either of you have the CUDA option Enable cuDNN benchmark feature
enabled? Try disabling that or the Use channels last as torch memory format
option.
Do either of you have the CUDA option
Enable cuDNN benchmark feature
enabled? Try disabling that or theUse channels last as torch memory format
option.
Yes, and that worked- thank you! it reduced the time of the initialization before hand too!
Try disabling that
Confirm, it helped! Thanks!
do note that feature Enable cuDNN benchmark
is disabled by default for a reason
what it does is tell CUDA to try all different options on how to optimize math operations before selecting best one - so its totally expected that initial execution will have a delay (while CUDA is actually internally benchmarking itself)
is there anything else to be done in this issue?
is there anything else to be done in this issue?
Is it possible to organize a one-time benchmark programmatically with the recording of tests and their subsequent use? does it make sense to look for the best option on the same hardware every time?
that is a totally valid request and i have that conversation open with torch team :) its not something that can be done on app level.
that is a totally valid request and i have that conversation open with torch team :) its not something that can be done on app level.
t
is there anything else to be done in this issue?
Is it possible to organize a one-time benchmark programmatically with the recording of tests and their subsequent use? does it make sense to look for the best option on the same hardware every time?
And it does work well after it's completed, since I got from 1.66 it/s to over 2 it/s after it's completed. what's weird is that it generates it first, and then freezes instead of freezing before it makes it.
depending on circumstances, it will run benchmark before/during/after, they all trigger different torch operations. also depends on which sampler you use - for example, unipc triggers denoiser at the end only, so if benchmark is optimizing ops inside denoiser, it will apear slow/stuck near the end only.
btw, i'll close the issue as root cause has been found, but feel free to post updates and i'll reopen if needed.
Issue Description
This only occurs with the first image that the program makes. All the subsequent ones immediately load into the UI.
This is not the initialization at the beginning, but after the image shows 100% completion in the console, it may take 20-30 seconds for it to update onto the WebUI.
Version Platform Description
Windows 10