triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.3k stars 1.48k forks source link

Can ensemble models cache ? #6643

Open haiminh2001 opened 11 months ago

haiminh2001 commented 11 months ago

Description Caching is not woring with ensemble models. Triton Information 23.07

Are you using the Triton container or did you build it yourself? Triton container

To Reproduce Steps to reproduce the behavior.

I enabled response cache for an ensemble model and tried to repeat one request multiple times but no cache lookup were done.

I expect that the ensemble should support caching because if an ensemble has, say 10 models, each model has its own cache, then each request will look for the caches 10 times, but if the ensemble looks for the caches, the procedure will only be done once.

kthui commented 11 months ago

@rmccorm4 do you know if this is expected? Ensemble models do not support caching.

rmccorm4 commented 11 months ago

@kthui is correct. Top-level requests to ensembles do not currently support caching at this time, but the composing models within the ensemble may be cached individually if supported by that model. Added a note to the docs to clarify this: https://github.com/triton-inference-server/server/pull/6648.

We do have an open feature request to add caching support to ensembles, it just hasn't been prioritized yet.

ref: DLIS-4626