Currently, this implementation of FreeU uses blocking operations for each block of the unet. This is terrible for performance, it forces device syncs to happen for every single unet block, which will leave the GPU idle for substantial periods of time.
I changed the operations that create the mask to have the tensor created on device and changed the slice multiply to a plain multiply. Now this plugin has about 60% less performance impact (from 6.15 it/s to 6.85 it/s with a baseline of 7.3 it/s). GPU usage during generation additionally increased from 91% to 99-100% (with other optimizations elsewhere in A1111 that remove the blocking operations there)
Currently, this implementation of FreeU uses blocking operations for each block of the unet. This is terrible for performance, it forces device syncs to happen for every single unet block, which will leave the GPU idle for substantial periods of time.
I changed the operations that create the mask to have the tensor created on device and changed the slice multiply to a plain multiply. Now this plugin has about 60% less performance impact (from 6.15 it/s to 6.85 it/s with a baseline of 7.3 it/s). GPU usage during generation additionally increased from 91% to 99-100% (with other optimizations elsewhere in A1111 that remove the blocking operations there)
This probably could be optimized even further, but at least it isn't blocking torch dispatch anymore. It might not show as much of an impact until some other optimization work on A1111 gets merged (see https://github.com/lllyasviel/stable-diffusion-webui-forge/discussions/716).