huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.86k stars 1.04k forks source link

Need instructions for how to optimize for production serving (fast startup) #1636

Closed steren closed 5 months ago

steren commented 6 months ago

Feature request

I suggest better educating developers how to download and optimize the model at build time (in container or in a volume) so that the command text-generation-launcher serves as fast as possible.

Motivation

By default, when running TGI using Docker, the container downloads the model on the fly and spend a long time optimizing it. The quicktour recommends using a local volume, which is great, but this isn't really compatible with autoscaled cloud environments, where container startup as to be as fast as possible.

Your contribution

As I explore this area, I will share my findings in this issue.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.