Open Manoj-Data-Science opened 11 months ago
Any thoughts about the issue 👀 ? @dacorvo
Edited comment to remove misleading requirement to share cached key values.
@Manoj-Data-Science, I think that to be efficient, assistive generation requires both models to share the cached key values run simultaneously on neuron cores.
For now optimum-neuron.NeuronModelForCausalLM
requires the cached key values to be stored on neuron cores, and only supports loading one model on neuron cores per python process.
Implementing assistive generation would require to find an efficient way to share cached key values between run in parallel a NeuronModelForCausalLM
instance on a set of cores, and communicate with alternatively:
NeuronModelForCausalLM
instance running on the same set of cores,NeuronModelForCausalLM
instance running on a separate set of cores using isolation (each instance only 'sees' a limited number of cores).
To my knowledge, only the second option none of this is possible with the features available today in the AWS Neuron SDK (more specifically transformers-neuronx
).
This could be implemented more easily though: https://github.com/apoorvumang/prompt-lookup-decoding.
cc @gante
Hi @dacorvo, Thanks for your reply. From last few days I was working on making the necessary changes to the optimum.neuron library to provide support for the assisted generation. I hope I was able to make those changes correctly. But as you mentioned, the problem that I was facing is when I try to load the assisted model when the main model is already loaded as it uses the same neuron cores as used by the main model. What are the different workaround possible for providing support of speculative decoding onto aws inf? The second point that you mentioned about putting models on different cores, is it possible to do this now ?
Hi @dacorvo, Thanks for your reply. From last few days I was working on making the necessary changes to the optimum.neuron library to provide support for the assisted generation. I hope I was able to make those changes correctly. But as you mentioned, the problem that I was facing is when I try to load the assisted model when the main model is already loaded as it uses the same neuron cores as used by the main model. What are the different workaround possible for providing support of speculative decoding onto aws inf? The second point that you mentioned about putting models on different cores, is it possible to do this now ?
To launch models on two different set of cores, you need today to "restrict" the visibility of the neuron compiler to a subset of the cores using environment variables as explained here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-configurable-parameters.html#nrt-configuration.
IMHO, this is very difficult to achieve this without having models running in two different processes, although if the cores are only checked when the model is loaded, it might work by tweaking the environment variables programmatically using os.environment
.
Forget my point about the cached key values: since the models are different, they cannot be shared anyway. I edited my original comment to avoid introducing more confusion.
Make Sense. Let me see what else I can do by tweaking the library.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I want to know how we can run the speculative decoding(Assisted Generation) to increase the token/sec for llama2 based model for optimum.neuron to run on inf2. Similar to what transformers have done for gpus by having the assistant_model support for any model in the transformers library. How assistant_model support can be made for the optimum_neuron library?