Why W8A8 is much slower and takes more GPU memory than fp16?

thu-nics / ViDiT-Q

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

https://a-suozhang.xyz/viditq.github.io/

31 stars 4 forks source link

Why W8A8 is much slower and takes more GPU memory than fp16? #3

Open Leo-yang-1020 opened 3 months ago

Leo-yang-1020 commented 3 months ago

When trying to repeat your code, we find that when inferencing using default fp16, the peak memory goes with: 截屏2024-07-19 15 30 58 about 9800MB But when inferencing using W8A8(after PTQ), the peak memory goes with:

about 9900MB And the inference speed is much slower than fp16 Is it reasonable or I did something wrong?

A-suozhang commented 3 months ago

Thank you for your interest in our work. We currently offer the code for "software quantization simulatio." For actual hardware resource savings, it is essential to employ the INT CUDA kernel. We are actively engaged in developing this CUDA kernel implementation.

Leo-yang-1020 commented 3 months ago

Thanks for your repeat!

Leo-yang-1020 commented 3 months ago

employ

But I still wonder why GPU memory didn't decade? It was the same as fp16. According to your theory and paper, it can reduce to 2.4X, and from my perspective, cuda kernel implementation only effects the inference speed, not memory. I tried W6A6, it appears the same peak memory.

A-suozhang commented 3 months ago

In our current Python simulation code, the data format remains in FP16 to facilitate FP16 computations, resulting in a memory cost comparable to that of FP16.

The memory expense is composed of two components: "static," which includes the model weight parameters stored on the GPU, and "dynamic," referring to the activations stored during the computation of the current layer.

Without a low-bit CUDA kernel, the activations must be in FP16 for FP16 computations. While it is possible to store the model weights in a low-bit format (a feature not yet implemented in our current code), these weights would need to be upcast to FP16 for the computation process.

Leo-yang-1020 commented 3 months ago

Thanks for your reply! Hope everything going well in your new feature.

xxw11 commented 3 weeks ago

Hello, I would like to ask how to reproduce the memory and latency data provided in the paper? @A-suozhang

xxw11 commented 3 weeks ago

If the values in the paper are obtained through estimation, could you provide the estimation method?

A-suozhang commented 3 weeks ago

You may need customized cuda kernels for actual speedup and memory savings. We are still cleaning up the cuda kernel code, and will release the kernel code soon. Plz stay tuned for the update.