Need instructions for how to optimize for production serving (fast startup)

huggingface / text-generation-inference

Large Language Model Text Generation Inference

Apache License 2.0

8.86k stars 1.04k forks source link

Feature request

I suggest better educating developers how to download and optimize the model at build time (in container or in a volume) so that the command text-generation-launcher serves as fast as possible.

Motivation

By default, when running TGI using Docker, the container downloads the model on the fly and spend a long time optimizing it. The quicktour recommends using a local volume, which is great, but this isn't really compatible with autoscaled cloud environments, where container startup as to be as fast as possible.

Your contribution

As I explore this area, I will share my findings in this issue.

huggingface / text-generation-inference