When launching a TGI instance with a non-neuron model as parameter, the model needs to be exported from cached neuron artifacts during the container startup.
Before this change, the export was done without minimizing the CPU memory, which made it impossible to use this kind of "on-the-fly" export on the smaller ml.inf2.xlarge instances.
What does this PR do?
When launching a TGI instance with a non-neuron model as parameter, the model needs to be exported from cached neuron artifacts during the container startup.
Before this change, the export was done without minimizing the CPU memory, which made it impossible to use this kind of "on-the-fly" export on the smaller
ml.inf2.xlarge
instances.