As reported in #576, the inference latency is heavily impacted when the weights and neff are not inlined. This is because the weights are not automatically loaded to neuron devices, and if we don't do that we suffer from huge host-device communication overhead.
This PR is supposed to patch this.
Caveat: Current data parallel API doesn't consider the case when weights and neff are not inlined. Here we use the class WeightSeparatedDataParallel as a temporary workaround. This will be included in Neuron SDK 2.20, and by then time this class will be removed from Optimum Neuron. And the current non-inlined models still show 1.5X latency compared to inlined models according to several small quick experiments.
Before submitting
[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you make sure to update the documentation with your changes?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
What does this PR do?
As reported in #576, the inference latency is heavily impacted when the weights and neff are not inlined. This is because the weights are not automatically loaded to neuron devices, and if we don't do that we suffer from huge host-device communication overhead.
This PR is supposed to patch this.
Caveat: Current data parallel API doesn't consider the case when weights and neff are not inlined. Here we use the class
WeightSeparatedDataParallel
as a temporary workaround. This will be included in Neuron SDK 2.20, and by then time this class will be removed from Optimum Neuron. And the current non-inlined models still show 1.5X latency compared to inlined models according to several small quick experiments.Before submitting