[RFC]: Model architecture plugins

NadavShmayo commented 3 months ago

Motivation.

As a continuation to #5367 - as this merge request was rejected and I have to maintain my own fork to support this scenario, I suggest we should add support in vLLM for model architecture plugins. This will allow vLLM to easily add new model architectures without changing vLLM's core logic, and support scenarios such as uneven GPU tensor parallelism.

We could build an ecosystem of model architecture plugins - which could accelerate new model support by a lot without risking existing functionality.

Proposed Change.

Supporting this in it's basic form is simple as we just have to add loaded plugins to the ModelRegistry. To support more complex model architectures (Such in the #5367 case), we should decouple the Config class which provides the amount of attention heads from vLLM's core logic, and allow each model architecture to override these values.

Feedback Period.

No response

CC List.

@youkaichao

Any Other Things.

Just to make it clear, I'll be happy to implement this, but I want hear some feedback before I go ahead and implement this.

DarkLight1337 commented 2 months ago

Potentially related: #7067 introduces an easy way to compose vLLM models, with the relevant code being abstracted by #7153.

sekh77 commented 2 months ago

@NadavShmayo - Is this plugin now available for use with latest vLLM?

sekh77 commented 2 months ago

I definitely have a need for this feature. I'm pretty sure many others will also be having a need for this to be available in vLLM.

I don't see any need to use more GPUs than is necessary to load a given model. For example, if I can load a model in exactly 5 GPUs, why would I need to allocate 8 GPUs to load that model.

Here's my situation and requirements:

I have 3 nodes in Azure with 12 A100 80GB GPUs (4 GPUs per node) connected through an Infiniband.
In my conversational AI chat application, users can dynamically switch between models in the chat screen right at runtime - so they have a choice to use one model over the other depending on how a model performs for their complex queries.
I want to pre-load my GPUs with LLaMA3.1 70B, Mixtral8x22B, and Databricks DBRX so that my users can choose from any of these three models during chat.
The application automatically calculates the model parameters based on information from model's config.json. And then it uses a formula to derive the exact number of GPUs a model will require to load and infer.
Based on this formula, LLaMA3.1 70B requires 3 GPUs, Mixtral8x22B requires 5 GPUs, and Databricks DBRX requires 4 GPUs.
Ideally all three models should fit in 12 GPUs. However, with the current vLLM architecture / the way it calculates, LLaMA3.1 70B will need 4 GPUs (64 attention head is not divisible by 3 but divisible by 4), Mixtral8x22B will need 8 GPUs, and DBRX will need 4 GPUs (no change for DBRX because 4 matches vLLMs expectation).
Now this puts me into a situation where I can load only any two models that would take up 8 GPUs leaving the remaining 4 GPUs unused. This is not a good use compute resources. especially given the fact that these are expensive GPUs.

I use pipeline_parallel_size = 1 and set tensor_parallel_size to be the exact number of GPUs that a model would need to load based on what is mentioned in this vLLM documentation for distributed inference - https://docs.vllm.ai/en/latest/serving/distributed_serving.html

So, anything that can be done to move away from the current constraints of 2,4,8,16 will be highly beneficial for a lot of Enterprises. This is a common feedback that I hear from people using vLLM. Everything else is absolutely great and awesome about vLLM. No doubt whatsoever.

youkaichao commented 2 months ago

@sekh77 I don't get it. You can just use pipeline_parallel_size=3 without any problem.

sekh77 commented 2 months ago

@youkaichao - Here's my understanding of pipeline_parallel_size. In my case, if I use pipeline_parallel_size = 3 and force tensor_parallel_size to be the exact number of GPUs in a node (which is 4), the world size in vLLM becomes 4*3 = 12.

This means when I attempt to load LLaMA3.1 70B which actually requires only 3 GPUs to load, the above configuration will load it across all 12 GPUs. Now with gpu_memory_utilization=0.9, I have no memory left in any of the 12 GPUs to load any other model because vLLM loads all weights, intermediate states, gradients, KV cache etc. in all GPUs. I'm unable to reduce gpu_memory_utliization to less than 0.7 as it gets into OOM due to the size of KV cache for these models.

I'm also helping it a little bit by specifying cpu_offload_gb=10.

If this understanding is incorrect, please do let me know. I'm absolutely OK to adjust my configurations based on appropriate guidelines that you advise. My objective is to meet my requirements as I described in my previous message.

youkaichao commented 2 months ago

if 3 GPUs are enough to hold the model, you can just use -pp 3 -tp 1

sekh77 commented 2 months ago

Ok. TP is calculated dynamically in my inference service pipeline. Assume if I found a way to dynamically override the TP calculation from 3 to 1 for LLaMA 3.1 70B, how will this calculation solve for Mixtral8x22B and Databricks DBRX that requires exactly 5 and 4 GPUs to hold the models, respectively?

youkaichao commented 2 months ago

in your script, you just need to change -tp to -pp , and everything should work.

use the tensor parallel size as the new pipeline parallel size

sekh77 commented 2 months ago

Are you suggesting to keep tp = 1 always, and then keep pp to the calculated number of GPUs for a model?

sekh77 commented 2 months ago

So for LLaMA3.1 70B, tp = 1, pp = 3 for Mixtral8x22B, tp = 1, pp = 5 for Databricks BDBRX, tp = 1, pp = 4

Is that what you are suggesting?

youkaichao commented 2 months ago

yes

sekh77 commented 2 months ago

Ok, got it. Let me try this. Will let you know. Thank you.

sekh77 commented 2 months ago

@youkaichao - It worked as per my expectations. Thank you very much for suggesting that route. I have an additional question though. Since the TP is now 1, I think there is no tensor parallelism anymore in my case. Am I losing anything with respect to inference throughput? Right now with PP, the latency is in milliseconds with these models on InfiniBand connectivity. But not sure what happens when I scale concurrency.

youkaichao commented 2 months ago

Am I losing anything with respect to inference throughput?

with pipeline parallel, you lose some latency, but the throughput should be the same. be sure to submit enough requests to saturate the pipeline.

vllm-project / vllm