huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
207 stars 61 forks source link

Assistive Generation for optimum.neuron on inf2 #347

Open Manoj-Data-Science opened 11 months ago

Manoj-Data-Science commented 11 months ago

I want to know how we can run the speculative decoding(Assisted Generation) to increase the token/sec for llama2 based model for optimum.neuron to run on inf2. Similar to what transformers have done for gpus by having the assistant_model support for any model in the transformers library. How assistant_model support can be made for the optimum_neuron library?

JingyaHuang commented 11 months ago

Any thoughts about the issue 👀 ? @dacorvo

dacorvo commented 11 months ago

Edited comment to remove misleading requirement to share cached key values.

@Manoj-Data-Science, I think that to be efficient, assistive generation requires both models to share the cached key values run simultaneously on neuron cores.

For now optimum-neuron.NeuronModelForCausalLM requires the cached key values to be stored on neuron cores, and only supports loading one model on neuron cores per python process.

Implementing assistive generation would require to find an efficient way to share cached key values between run in parallel a NeuronModelForCausalLM instance on a set of cores, and communicate with alternatively:

To my knowledge, only the second option none of this is possible with the features available today in the AWS Neuron SDK (more specifically transformers-neuronx).

dacorvo commented 11 months ago

This could be implemented more easily though: https://github.com/apoorvumang/prompt-lookup-decoding.

cc @gante

Manoj-Data-Science commented 11 months ago

Hi @dacorvo, Thanks for your reply. From last few days I was working on making the necessary changes to the optimum.neuron library to provide support for the assisted generation. I hope I was able to make those changes correctly. But as you mentioned, the problem that I was facing is when I try to load the assisted model when the main model is already loaded as it uses the same neuron cores as used by the main model. What are the different workaround possible for providing support of speculative decoding onto aws inf? The second point that you mentioned about putting models on different cores, is it possible to do this now ?

dacorvo commented 11 months ago

Hi @dacorvo, Thanks for your reply. From last few days I was working on making the necessary changes to the optimum.neuron library to provide support for the assisted generation. I hope I was able to make those changes correctly. But as you mentioned, the problem that I was facing is when I try to load the assisted model when the main model is already loaded as it uses the same neuron cores as used by the main model. What are the different workaround possible for providing support of speculative decoding onto aws inf? The second point that you mentioned about putting models on different cores, is it possible to do this now ?

To launch models on two different set of cores, you need today to "restrict" the visibility of the neuron compiler to a subset of the cores using environment variables as explained here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-configurable-parameters.html#nrt-configuration.

IMHO, this is very difficult to achieve this without having models running in two different processes, although if the cores are only checked when the model is loaded, it might work by tweaking the environment variables programmatically using os.environment.

Forget my point about the cached key values: since the models are different, they cannot be shared anyway. I edited my original comment to avoid introducing more confusion.

Manoj-Data-Science commented 11 months ago

Make Sense. Let me see what else I can do by tweaking the library.

HuggingFaceDocBuilderDev commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

github-actions[bot] commented 4 days ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.