triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.25k stars 1.47k forks source link

Variable-length, row-based, TF Example and TF ExampleListWithContext support #1660

Closed jeisinge closed 3 years ago

jeisinge commented 4 years ago

Is your feature request related to a problem? Please describe. Our inference requests are a bit different than the traditional image inference requests. In particular,

  1. Our examples have text features of variable length.
  2. We infer on hundreds of examples with each example sharing a large number of the same features.

This leads to a couple of complexities.

  1. Due to the variable length of our text features, we would have to add extra sophistication to our inference.
  2. By duplicating the common features for every example, on the client and server, we use more memory, use more bandwidth, and spend more time parsing.

Describe the solution you'd like TF-Serving provides this type of support in the form of a row-based input. This comes in two forms:

For Estimator SavedModels, TensorFlow allows for exporting with a parsing serving receiver that takes as input an Example. This allows for requests to have row-based data. Because it is row-based, the length of the rows can be variable. The result is that text fields that have different sizes can efficiently use this format for bandwidth and parsing.

Further, TensorFlow Serving allows for providing common features via ExampleListWithContext. This separates out tensor into context and items[]. On the server, the context is broadcast/repeated to each item and then inferred.

Describe alternatives you've considered For the variable-length text tensors, since the tensor input is fixed rectangular, we would have to choose the largest text size for this tensor and pad to this size. This would increase the memory and bandwidth complexities for the client and server. And, we would have to add additional pre-processing code to process this padded input.

Additional context If there are alternative solutions, please enlighten us!

mcf330 commented 3 years ago

Save issue for me, anyone can provide a workaround?

dzier commented 3 years ago

More recent versions of Triton can support variable length tensors. Please see https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#inputs-and-outputs for more information.