nod-ai / SHARK-Studio

SHARK Studio -- Web UI for SHARK+IREE High Performance Machine Learning Distribution
Apache License 2.0
1.42k stars 172 forks source link

[8GB 5700 XT] As of build 628 commit #4fac46f, inference performance has decreased severely #1238

Open GTD-Carthage opened 1 year ago

GTD-Carthage commented 1 year ago

Tested in WebUI with a custom model with newly compiled flatbuffers as well as cleared and rebuilt virtual environment.

On 512x768, performance has dipped from 2it/s to around 5s/it On 512x512 (tuning applied), performance has dipped from 5it/s to around 3s/it

Flatbuffers from days prior seem to have worked just fine (used one before switching between txt2img and img2img therefore triggering a recompile). Performance metrics from testing on both txt2img and img2img.

Args passed on PowerShell: --local_tank_cache="D:\StableDiffusionNODAI\local_tank" --enable_stack_trace --no-progress_bar --vulkan_large_heap_block_size="0" --no-used_tuned --import_mlir --attention_slicing="auto" --custom_vae="D:\StableDiffusionNODAI\inference\models\KENSHI01_baked.ckpt" --device_allocator="caching"

Also, extra side note, tuning is still applied to 512x512 compiles despite having the --no-used_tuned flag though since this hasn't really been an impediment, I wasn't able to report on this particularly

monorimet commented 1 year ago

Hi, thanks so much for taking time to report this. Have you always included the --vulkan_large_heap_block_size="0" flag? This has a sweet spot for VRAM consumption vs. performance gain. Can you try --vulkan_large_heap_block_size="1610612736" ?

Also, were you able to run without memory issues any versions before 628? We have seen some issues with 8gb RDNA2 devices since build 593.

GTD-Carthage commented 1 year ago

Yes, actually, I've been using the size="0" flag for pretty much a while now and I'd like to say this is possibly why I may have evaded issues brought up in 593. :)

I was also among those experimenting with the heap size in a recent discussion but I found out assigning a larger value didn't seem to grant any significant change in performance while more of VRAM got reserved - however, enabling --device_allocator="caching" definitely did (e.g. 1.5it/s on 512x768 soared up to 2.0it/s), so I ended up keeping a 0 heap size to keep some VRAM for other apps.

As of just now, I tried assigning 1.5G to the heap size but unfortunately, this did not seem to help and in fact, seem to have aggravated the inference speed further (the reference 3s/it somehow went up to 4s/it for some reason) :(

I can also definitely confirm compiled flatbuffers from several days prior still work fine (particularly one I compiled from March 19, before 628), so I think the issue may definitely be more towards the way flatbuffers are compiled since then

PS - Just in case, I have been working with the repository clone rather than the executable :)

GTD-Carthage commented 1 year ago

Just wanted to report that after some more testing and playing around with flags and recompiling flatbuffers, I've been able to narrow down the issue - apparently having attention slicing turned on was responsible for causing my performance overall to sink (and that using flatbuffers compiled with attention slicing on had poor inference speed, while no attention slicing ones did not while having the flag on) :D