Open rebel-jonghewk opened 3 months ago
Hello @WoosukKwon @zhuohan123 @simon-mo,
I understand that you may have a lot on your plate, but I would greatly appreciate any feedback or thoughts you might have on this proposal when you have a moment. Your input would be invaluable in helping to refine and move this project forward. :)
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Gentle reminder about this RFC - any feedback would be greatly appreciated.
Motivation.
The RBLN SDK provides a solution for innovative deep learning inference on Rebellion's NPUs, such as ATOM and REBEL, including support for large language models (LLMs). This project aims to develop the RBLN backend for vLLM, initially prioritizing the ATOM device, with future plans to enable REBEL support.
In alignment with Rebellion's Optimum Huggingface extension documentation, RBLN backend will support a wide range of models available in the Rebellion's Model Zoo.
The project currently incorporates continuous batching feature and will soon integrate additional techniques, such as PagedAttention, to enhance performance further.
Proposed Change.
Introduce the RBLN vLLM backend, which will:
requirements.txt
for the RBLN environment.Target Models
We will start by ensuring vLLM works with the Llama architecture and expand to other architectures. The full list of LLMs supported by RBLN can be viewed here.
Design
We will introduce several custom classes that align with the vLLM architecture for heterogeneous accelerators (such as Neuron, XPU, TPU...). See the diagram below for details.
Implementation Details
Initalize model
Model-specific (e.g. llama specific) forward functions
References
Feedback Period.
1w
CC List.
@WoosukKwon , @rebel-shshin, @rebel-hekim, @rebel-hongseok
Any Other Things.