huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.

Apache License 2.0

176 stars 51 forks source link

Fix excessive CPU memory consumption on TGI startup #595

Closed dacorvo closed 1 month ago

dacorvo commented 1 month ago

What does this PR do?

When launching a TGI instance with a non-neuron model as parameter, the model needs to be exported from cached neuron artifacts during the container startup.

Before this change, the export was done without minimizing the CPU memory, which made it impossible to use this kind of "on-the-fly" export on the smaller ml.inf2.xlarge instances.