Response caching GPU tensors

triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

BSD 3-Clause "New" or "Revised" License

8.4k stars 1.49k forks source link

Hi @rahchuenmonroe,

This applies to input/output tensors within Triton core, before and after the model execution in the backend. If you are communicating with Triton over the network (HTTP/GRPC), then all request and response tensors will be on CPU when going through Triton by default.

Using CUDA shared memory is a different story, but assumes client/server are co-located
Backends that execute the model on GPU will handle copying the data to/from CPU

So long story short, if you're talking to Triton over the network without using shared memory (and therefore communicating tensors over CPU), you can likely cache the responses even if they are from a model that is running on GPU. This is the large majority of use cases.

If you are using Triton in-process or using CUDA shared memory and passing Triton tensors that are already on GPU, then caching of those tensors is not currently supported.

triton-inference-server / server

Response caching GPU tensors #7140