Can you provide inference time data for Stable Diffusion (w4a4 vs full precision fp32) on GPU/CPU?

hatchetProject / QuEST

QuEST: Efficient Finetuning for Low-bit Diffusion Models

26 stars 2 forks source link

Can you provide inference time data for Stable Diffusion (w4a4 vs full precision fp32) on GPU/CPU? #8

Open badhri-intel opened 2 months ago

badhri-intel commented 2 months ago

In the paper, it says using w4a4 quantization can theoretically produce 8x inference speedup. Could you please confirm this for SD or what sort of speedup (inference latency) you observed? Thanks