triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.14k stars 1.46k forks source link

Response caching GPU tensors #7140

Open rahchuenmonroe opened 5 months ago

rahchuenmonroe commented 5 months ago

According to your docs, only input tensors located in CPU memory will be hashable for accessing the cache. And only responses with all output tensors located in CPU memory will be eligible for caching.

Does this mean that if a model runs on GPU, the requests will not be able to be cached since their outputs are on GPU? If that's the case, I think it would be great if we could cache tensors that are located on GPU since a lot of models running on Triton run on GPU.

rmccorm4 commented 5 months ago

Hi @rahchuenmonroe,

This applies to input/output tensors within Triton core, before and after the model execution in the backend. If you are communicating with Triton over the network (HTTP/GRPC), then all request and response tensors will be on CPU when going through Triton by default.

So long story short, if you're talking to Triton over the network without using shared memory (and therefore communicating tensors over CPU), you can likely cache the responses even if they are from a model that is running on GPU. This is the large majority of use cases.

If you are using Triton in-process or using CUDA shared memory and passing Triton tensors that are already on GPU, then caching of those tensors is not currently supported.