What does this PR do?

As reported in #576, the inference latency is heavily impacted when the weights and neff are not inlined. This is because the weights are not automatically loaded to neuron devices, and if we don't do that we suffer from huge host-device communication overhead.

This PR is supposed to patch this.

Caveat: Current data parallel API doesn't consider the case when weights and neff are not inlined. Here we use the class WeightSeparatedDataParallel as a temporary workaround. This will be included in Neuron SDK 2.20, and by then time this class will be removed from Optimum Neuron. And the current non-inlined models still show 1.5X latency compared to inlined models according to several small quick experiments.

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you make sure to update the documentation with your changes?
[ ] Did you write any new necessary tests?

huggingface / optimum-neuron

[Inference] Fix inference latency issue when weights/neff are separated #584

What does this PR do?

Before submitting