I suggest better educating developers how to download and optimize the model at build time (in container or in a volume) so that the command text-generation-launcher serves as fast as possible.
Motivation
By default, when running TGI using Docker, the container downloads the model on the fly and spend a long time optimizing it.
The quicktour recommends using a local volume, which is great, but this isn't really compatible with autoscaled cloud environments, where container startup as to be as fast as possible.
Your contribution
As I explore this area, I will share my findings in this issue.
Feature request
I suggest better educating developers how to download and optimize the model at build time (in container or in a volume) so that the command
text-generation-launcher
serves as fast as possible.Motivation
By default, when running TGI using Docker, the container downloads the model on the fly and spend a long time optimizing it. The quicktour recommends using a local volume, which is great, but this isn't really compatible with autoscaled cloud environments, where container startup as to be as fast as possible.
Your contribution
As I explore this area, I will share my findings in this issue.