huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
176 stars 51 forks source link

[Inference] Fix inference latency issue when weights/neff are separated #584

Closed JingyaHuang closed 1 month ago

JingyaHuang commented 2 months ago

What does this PR do?

As reported in #576, the inference latency is heavily impacted when the weights and neff are not inlined. This is because the weights are not automatically loaded to neuron devices, and if we don't do that we suffer from huge host-device communication overhead.

This PR is supposed to patch this.

Caveat: Current data parallel API doesn't consider the case when weights and neff are not inlined. Here we use the class WeightSeparatedDataParallel as a temporary workaround. This will be included in Neuron SDK 2.20, and by then time this class will be removed from Optimum Neuron. And the current non-inlined models still show 1.5X latency compared to inlined models according to several small quick experiments.

Before submitting

HuggingFaceDocBuilderDev commented 2 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.