Assistive Generation for optimum.neuron on inf2

Manoj-Data-Science commented 11 months ago

I want to know how we can run the speculative decoding(Assisted Generation) to increase the token/sec for llama2 based model for optimum.neuron to run on inf2. Similar to what transformers have done for gpus by having the assistant_model support for any model in the transformers library. How assistant_model support can be made for the optimum_neuron library?

JingyaHuang commented 11 months ago

Any thoughts about the issue 👀 ? @dacorvo

dacorvo commented 11 months ago

Edited comment to remove misleading requirement to share cached key values.

@Manoj-Data-Science, I think that to be efficient, assistive generation requires both models to ~~share the cached key values~~ run simultaneously on neuron cores.

For now optimum-neuron.NeuronModelForCausalLM ~~requires the cached key values to be stored on neuron cores, and~~ only supports loading one model on neuron cores per python process.

Implementing assistive generation would require to find an efficient way to ~~share cached key values between~~ run in parallel a NeuronModelForCausalLM instance on a set of cores, and communicate with alternatively:

a NeuronModelForCausalLM instance running on the same set of cores,
a NeuronModelForCausalLM instance running on a separate set of cores using isolation (each instance only 'sees' a limited number of cores). ~~- an AutoModelForCausalLM model running on CPU.~~

To my knowledge, only the second option ~~none of this~~ is possible with the features available today in the AWS Neuron SDK (more specifically transformers-neuronx).

dacorvo commented 11 months ago

This could be implemented more easily though: https://github.com/apoorvumang/prompt-lookup-decoding.

cc @gante

Manoj-Data-Science commented 11 months ago

Hi @dacorvo, Thanks for your reply. From last few days I was working on making the necessary changes to the optimum.neuron library to provide support for the assisted generation. I hope I was able to make those changes correctly. But as you mentioned, the problem that I was facing is when I try to load the assisted model when the main model is already loaded as it uses the same neuron cores as used by the main model. What are the different workaround possible for providing support of speculative decoding onto aws inf? The second point that you mentioned about putting models on different cores, is it possible to do this now ?

dacorvo commented 11 months ago

Hi @dacorvo, Thanks for your reply. From last few days I was working on making the necessary changes to the optimum.neuron library to provide support for the assisted generation. I hope I was able to make those changes correctly. But as you mentioned, the problem that I was facing is when I try to load the assisted model when the main model is already loaded as it uses the same neuron cores as used by the main model. What are the different workaround possible for providing support of speculative decoding onto aws inf? The second point that you mentioned about putting models on different cores, is it possible to do this now ?

To launch models on two different set of cores, you need today to "restrict" the visibility of the neuron compiler to a subset of the cores using environment variables as explained here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-configurable-parameters.html#nrt-configuration.

IMHO, this is very difficult to achieve this without having models running in two different processes, although if the cores are only checked when the model is loaded, it might work by tweaking the environment variables programmatically using os.environment.

Forget my point about the cached key values: since the models are different, they cannot be shared anyway. I edited my original comment to avoid introducing more confusion.

Manoj-Data-Science commented 11 months ago

Make Sense. Let me see what else I can do by tweaking the library.