Closed zw0610 closed 4 years ago
This issue may be related: https://github.com/NVIDIA/TensorRT/issues/65
/cc @yyyt1994
You need to file this enhancement request against TensorRT. Triton just uses the APIs provided by TensorRT to load and execute the models. If TensorRT implements the functionality that you are requesting then Triton will take advantage of it.
Closing. Please link the related TensorRT issue here when you file it.
I found out that this is something TensorRT has already implemented as 'Refit': https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#refitting-engine-c
@deadeyegoodwin would you mind taking a look at the document of TensorRT refitting and reopening this issue?
Is your feature request related to a problem? Please describe. As we are updating model versions quite frequently, it will saving much time on building engine (exporting as plan file) by separating weights and graph from the plan file. Every time a new version is generated, we will need take a considerable amount of time to build the engine as many optimizations tricks will be applied to the engine, which is redundant since the new version does not change the graph at all. It only updates the weights of the model.
Describe the solution you'd like
As far as I understand, the phase of building engine takes more time as it searches and applies optimizations to the graph. In the approach described above, we can save much time when updating weights and serve it with Triton Inference Server.
Describe alternatives you've considered If the bottleneck when updating model versions does lie on building engine phase, dumping weights directly without the building operation seems necessary. However, for Triton Inference Server, we may simply extend the Plan file format to a Plan+Wts files format. That is to say, if Triton observes a plan file And a wts file, read the weights from wts file, otherwise just read weights from the plan file.
Additional context We are performing reinforcement learning with frequent updating on model versions.