huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.

Apache License 2.0

176 stars 51 forks source link

Closed dacorvo closed 2 months ago

dacorvo commented 2 months ago

What does this PR do?

This adds scripts to test TGI deployments using several TGI servers on the same host and a load-balancer to achieve Data Parallelism.

The test client is llmperf.

It also includes results for LLama 7b and Mistral v2 deployed on a inf2.48xlarge in a DP3 TP8 configuration.