triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.22k stars 1.47k forks source link

Separate weights from plan file for TensorRT backend #2036

Closed zw0610 closed 4 years ago

zw0610 commented 4 years ago

Is your feature request related to a problem? Please describe. As we are updating model versions quite frequently, it will saving much time on building engine (exporting as plan file) by separating weights and graph from the plan file. Every time a new version is generated, we will need take a considerable amount of time to build the engine as many optimizations tricks will be applied to the engine, which is redundant since the new version does not change the graph at all. It only updates the weights of the model.

Describe the solution you'd like

This solution may need updating on both triton and TensorRT.

  • On the Triton side, for TensorRT backend, instead of supporting format of single plan file, we suggest an additional format for TensorRT backend with multiple files: one file for optimized engine, one file for weights and maybe other files for additional information.
  • On the TensorRT side, besides the first version, which we need to build the engine with both graph and weights, we can simply dump weights to a wts file and replacing it in the repo where Triton server watches.

As far as I understand, the phase of building engine takes more time as it searches and applies optimizations to the graph. In the approach described above, we can save much time when updating weights and serve it with Triton Inference Server.

Describe alternatives you've considered If the bottleneck when updating model versions does lie on building engine phase, dumping weights directly without the building operation seems necessary. However, for Triton Inference Server, we may simply extend the Plan file format to a Plan+Wts files format. That is to say, if Triton observes a plan file And a wts file, read the weights from wts file, otherwise just read weights from the plan file.

Additional context We are performing reinforcement learning with frequent updating on model versions.

zw0610 commented 4 years ago

This issue may be related: https://github.com/NVIDIA/TensorRT/issues/65

zw0610 commented 4 years ago

/cc @yyyt1994

deadeyegoodwin commented 4 years ago

You need to file this enhancement request against TensorRT. Triton just uses the APIs provided by TensorRT to load and execute the models. If TensorRT implements the functionality that you are requesting then Triton will take advantage of it.

deadeyegoodwin commented 4 years ago

Closing. Please link the related TensorRT issue here when you file it.

zw0610 commented 4 years ago

I found out that this is something TensorRT has already implemented as 'Refit': https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#refitting-engine-c

@deadeyegoodwin would you mind taking a look at the document of TensorRT refitting and reopening this issue?