Closed tehkillerbee closed 2 years ago
It is OK to enable both of them. TensorRT will choose a higher-precision kernel if it results in overall lower runtime, or if no low-precision implementation exists. Read this for more details.
@grimoire That makes sense however Nvidia states "..There are three precision flags: FP16, INT8, and TF32, and they may be enabled independently.." so are you really supposed to enable more than one at the same time? Well if it works, I guess it is fine. In any case, I will test this further and see what happens on my Jetson AGX Xavier.
Slightly off-topic. In the link you sent it states that "..TensorRT will still choose a higher-precision kernel if it results in overall lower runtime..." However, if I enable FP16 when using a GPU architecture that does not use it (eg. my Quadro P2000), a warning is given:
[TRT] [W] Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
Naturally, the default FP32 would've been the fastest - but the FP16 is still used instead. So does TensorRT actually pick the fastest one in this case?
According to my experiment, enabling both flags is slightly faster than int8 or fp16 only, so I guess TensorRT would do some optimization about the precision.
And I guess TensorRT will use fp16 if the layer does not support int8(whatever device). That's why the performance would be bad on the device without fp16 support.
@grimoire I see, that is good to know. In that case, I think we can close this issue.
Hello,
While looking through the code for deploying a model using TensorRT, I noticed that in the base config for tensorrt-int8, we set both FP16 and INT8=true. Is this intentional? I figured it was just a typo. I am not sure what happens when both BuilderFlags are enabled, as I have never tested it myself.
https://github.com/open-mmlab/mmdeploy/blob/03ae26c91cd088806f9ab22eae4a05650404e062/configs/_base_/backends/tensorrt-int8.py#L3