leimao / PyTorch-Static-Quantization

PyTorch Static Quantization Example
https://leimao.github.io/blog/PyTorch-Static-Quantization/
MIT License
39 stars 5 forks source link

Question of the quantization #2

Open SangbumChoi opened 2 years ago

SangbumChoi commented 2 years ago

Hi, thanks for the great code.

It works well without any modification.

However, the result that I ran seems curious. My GPU is Tesla V100-SXM2-32GB and CPU is Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz. Additionally, OS is Linux.

First trial

FP32 CPU Inference Latency: 3.54 ms / sample FP32 CUDA Inference Latency: 3.92 ms / sample INT8 CPU Inference Latency: 11.76 ms / sample INT8 JIT CPU Inference Latency: 4.50 ms / sample

Second trial

FP32 CPU Inference Latency: 3.70 ms / sample FP32 CUDA Inference Latency: 3.87 ms / sample INT8 CPU Inference Latency: 9.38 ms / sample INT8 JIT CPU Inference Latency: 6.60 ms / sample

Third trial

FP32 CPU Inference Latency: 3.88 ms / sample FP32 CUDA Inference Latency: 3.92 ms / sample INT8 CPU Inference Latency: 19.98 ms / sample INT8 JIT CPU Inference Latency: 4.65 ms / sample

those are the result I got from your code. I expected that INT related model should be way more faster than FP. Do you have any explanation or idea of the above situations?

aj2563 commented 2 years ago

I am also seeing similar results. Is there a simple explanation for this?

zhanghongjie101 commented 1 year ago

I think they are affected by the multithreading of torch, you can compare it by setting 'torch.set_num_threads(1)' And I got the result. FP32 CPU Inference Latency: 6.59 ms / sample INT8 CPU Inference Latency: 3.03 ms / sample

Apisteftos commented 1 year ago

I run the code without any modification or calibration and get the results from my RTX 1070

FP32 evaluation accuracy: 0.781 INT8 evaluation accuracy: 0.779 FP32 CPU Inference Latency: 6.41 ms / sample FP32 CUDA Inference Latency: 3.01 ms / sample INT8 CPU Inference Latency: 2.67 ms / sample INT8 JIT CPU Inference Latency: 0.91 ms / sample

jiunyen-ching commented 3 months ago

I think they are affected by the multithreading of torch, you can compare it by setting 'torch.set_num_threads(1)' And I got the result. FP32 CPU Inference Latency: 6.59 ms / sample INT8 CPU Inference Latency: 3.03 ms / sample

Seems weird that the latency issue is fixed with this since both inferences are performed on the CPU. Setting any number of threads should affect them equally...