Proof of Concept (PoC) a generic inference container that uses Triton as the inference engine and can download and utilize a ModelKit as efficiently as possible.
Describe the solution you'd like
Generic Container or Base Container:
The solution can be a generic container with enough meta information or a base container that gets custom built for the ModelKit.
By default, the artifacts from the ModelKit should not be baked into the container. Instead, they should be downloaded by the entrypoint or an init container.
Model Download Options:
As an alternative, the model can be baked into the init container.
Streaming Models:
Explore ways to stream the models directly into the GPU memory when using Triton.
Describe alternatives you've considered
Baking Artifacts into the Container:
Considered baking the artifacts directly into the container, but this approach lacks flexibility and can lead to larger container sizes.
External Model Storage:
Using external storage solutions to host the models and mount them at runtime. This adds complexity and potential latency.
On-Demand Model Fetching:
Fetching models on-demand during inference requests. This could introduce latency during the initial request.
Additional context
The goal is to achieve efficient and flexible model management within the inference container.
Consider potential performance implications of different model loading strategies, especially with respect to Triton's capabilities.
Ensure compatibility with existing KitOps and ModelKit infrastructure and suggest improvements.
Describe the problem you're trying to solve
Proof of Concept (PoC) a generic inference container that uses Triton as the inference engine and can download and utilize a ModelKit as efficiently as possible.
Describe the solution you'd like
Generic Container or Base Container:
Model Download Options:
Streaming Models:
Describe alternatives you've considered
Baking Artifacts into the Container:
External Model Storage:
On-Demand Model Fetching:
Additional context