open-mmlab / mmdeploy

OpenMMLab Model Deployment Framework
https://mmdeploy.readthedocs.io/en/latest/
Apache License 2.0
2.64k stars 612 forks source link

When configuring TensorRT backend for int8, both int8 and fp16 is enabled #52

Closed tehkillerbee closed 2 years ago

tehkillerbee commented 2 years ago

Hello,

While looking through the code for deploying a model using TensorRT, I noticed that in the base config for tensorrt-int8, we set both FP16 and INT8=true. Is this intentional? I figured it was just a typo. I am not sure what happens when both BuilderFlags are enabled, as I have never tested it myself.

https://github.com/open-mmlab/mmdeploy/blob/03ae26c91cd088806f9ab22eae4a05650404e062/configs/_base_/backends/tensorrt-int8.py#L3

grimoire commented 2 years ago

It is OK to enable both of them. TensorRT will choose a higher-precision kernel if it results in overall lower runtime, or if no low-precision implementation exists. Read this for more details.

tehkillerbee commented 2 years ago

@grimoire That makes sense however Nvidia states "..There are three precision flags: FP16, INT8, and TF32, and they may be enabled independently.." so are you really supposed to enable more than one at the same time? Well if it works, I guess it is fine. In any case, I will test this further and see what happens on my Jetson AGX Xavier.

Slightly off-topic. In the link you sent it states that "..TensorRT will still choose a higher-precision kernel if it results in overall lower runtime..." However, if I enable FP16 when using a GPU architecture that does not use it (eg. my Quadro P2000), a warning is given:

[TRT] [W] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.

Naturally, the default FP32 would've been the fastest - but the FP16 is still used instead. So does TensorRT actually pick the fastest one in this case?

grimoire commented 2 years ago

According to my experiment, enabling both flags is slightly faster than int8 or fp16 only, so I guess TensorRT would do some optimization about the precision.

And I guess TensorRT will use fp16 if the layer does not support int8(whatever device). That's why the performance would be bad on the device without fp16 support.

tehkillerbee commented 2 years ago

@grimoire I see, that is good to know. In that case, I think we can close this issue.