ljleb / sd-webui-freeu

a1111 implementation of https://github.com/ChenyangSi/FreeU
MIT License
301 stars 16 forks source link

Optimizations #47

Closed drhead closed 3 months ago

drhead commented 3 months ago

Currently, this implementation of FreeU uses blocking operations for each block of the unet. This is terrible for performance, it forces device syncs to happen for every single unet block, which will leave the GPU idle for substantial periods of time.

I changed the operations that create the mask to have the tensor created on device and changed the slice multiply to a plain multiply. Now this plugin has about 60% less performance impact (from 6.15 it/s to 6.85 it/s with a baseline of 7.3 it/s). GPU usage during generation additionally increased from 91% to 99-100% (with other optimizations elsewhere in A1111 that remove the blocking operations there)

This probably could be optimized even further, but at least it isn't blocking torch dispatch anymore. It might not show as much of an impact until some other optimization work on A1111 gets merged (see https://github.com/lllyasviel/stable-diffusion-webui-forge/discussions/716).

ljleb commented 3 months ago

Thanks for the contribution.