triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.3k stars 1.48k forks source link

are FP8 models supported in Triton ?? #7678

Open jayakommuru opened 1 month ago

jayakommuru commented 1 month ago

We have an encoder based model, and we have currently deployed in FP16 mode in production and we want to reduce the latecny further.

Does triton support FP8 ? In the datatypes documentation here: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#datatypes I don't see FP8 in the datatypes.

We are using trtexec CLI to convert onnx to trt engine file. I see an option --fp8 to generate fp8 serialized engine files. Can anyone confirm if we can deploy FP8 models in triton?

jayakommuru commented 1 month ago

@oandreeva-nv can you help with this ^^ ?

oandreeva-nv commented 1 month ago

Hi @jayakommuru, let me verify it. I'll get back to you

oandreeva-nv commented 1 month ago

RT backend does not support FP8 I/O for the TRT engine. However, weights and internal tensors can be FP8.

jayakommuru commented 1 month ago

@oandreeva-nv Ok, Can there be any throughput/performance benefits by running FP8 TRT engine file with FP16 I/O? which triton data type should be used with FP8 TRT engine file in TRT backend ?

jayakommuru commented 1 month ago

@oandreeva-nv can you confirm if using FP16 I/O triton datatypes and FP8 TRT engine, does it give any benefit? Thanks

oandreeva-nv commented 1 month ago

Hi @jayakommuru , we have a perf_analyzer tool, that can help you analyzing the performance of your model.

jayakommuru commented 1 month ago

@oandreeva-nv Sure, will explore the perf-analyzer. Any idea whether to use FP32 or FP16 I/O datatype of triton for TensorRT FP8 models ?