Why the quantized net is slower?

huggingface / optimum-quanto

A pytorch quantization backend for optimum

Apache License 2.0

645 stars 36 forks source link

Why the quantized net is slower? #184

Closed theguardsgod closed 2 weeks ago

theguardsgod commented 2 months ago

batch_size: 1, torch_dtype: fp32, unet_dtype: int8 in 3.754 seconds. Memory: 5.240GB.

batch_size: 1, torch_dtype: fp32, unet_dtype: None in 3.378 seconds. Memory: 6.073GB.

I'm using the example code for stable diffusion, but the inference time is slower for quantized int8 version (I've also test the speed on my own model, and quantization brings larger VRAM and slower inference time). Why is that case?

canamika27 commented 1 month ago

same observation

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been stalled for 5 days with no activity.