Closed simon-mo closed 5 months ago
@simon-mo for prefill disaggregation. from the splitwise and distserve paper, they all build solution on top of vLLM for evaluation. Any contribution from these teams? is vLLM community open for public contribution for this feature?
@Jeffwan yes! We are actively with the authors of both papers to integrate the work properly. We are also working with Sarathi's authors for chunked prefill as well.
Any update for PEFT?
please consider support huggingface peft, thank you. #1129
Hi @kanseaveg, we do support LoRA and planning to add prefix tuning support, which should allow Hugging face PEFT model format. Which PEFT methods are you interested in?
@simon-mo Thank you very much for your reply.There are three common types of tuning methods that I am currently concerned about:
Maybe consider supporting QuaRot quantization scheme?
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way that removes outliers from the hidden state without changing the output, making quantization easier. This computational invariance is applied to the hidden state (residual) of the LLM, as well as to the activations of the feed-forward components, aspects of the attention mechanism and to the KV cache. The result is a quantized model where all matrix multiplications are performed in 4-bits, without any channels identified for retention in higher precision. Our quantized LLaMa2-70B model has losses of at most 0.29 WikiText-2 perplexity and retains 99% of the zero-shot performance. Code is available at: this https URL.
I think this would be huge for larger models like Command-R+ (104B) being able to fit into a single 80G A100 with negligible performance losses.
Very excited to see both Embedding models and CPU support on the roadmap!
These being implemented would make vLLM my default model serving engine.
Very excited to see that the tensorizer
PR is in this roadmap! Sorry about all the pings, I'm just passionate about getting this to vLLM users :D More than happy to be of any assistance in getting that feature implemented :)
Will larger vocabulary size for multi-lora be supported in Q2 2024? Related: https://github.com/vllm-project/vllm/issues/3000
I'm very interested in implementing tree attention for speculative decoding. @simon-mo
Will larger vocabulary size for multi-lora be supported in Q2 2024? Related: #3000
https://github.com/vllm-project/vllm/pull/4015 had done this
Will larger vocabulary size for multi-lora be supported in Q2 2024? Related: #3000
4015 had done this
This is strange, serving lora finetune for Llama-3 (vocab size 12800) has the same problem, When using LoRA, vocab size must be 32000 >= vocab_size <= 33024
, however same code finetune for Qwen1.5-7B-Chat, with vocab size 151643, has no such serving problem, why?
Will larger vocabulary size for multi-lora be supported in Q2 2024? Related: #3000
4015 had done this
This is strange, serving lora finetune for Llama-3 (vocab size 12800) has the same problem,
When using LoRA, vocab size must be 32000 >= vocab_size <= 33024
, however same code finetune for Qwen1.5-7B-Chat, with vocab size 151643, has no such serving problem, why?
the function create_lora_weights
from LogitsProcessorWithLoRA throws this error.
Model using llama architecture designate lm_head as a target module for lora, and need instantiate LogitsProcessorWithLoRA
,refer to: https://github.com/vllm-project/vllm/blob/main/vllm/lora/models.py#438
Models such as qwen-2 don't designate lm_head as a target module for lora,so,They don't instantiate LogitsProcessorWithLoRA
Will larger vocabulary size for multi-lora be supported in Q2 2024? Related: #3000
4015 had done this
This is strange, serving lora finetune for Llama-3 (vocab size 12800) has the same problem,
When using LoRA, vocab size must be 32000 >= vocab_size <= 33024
, however same code finetune for Qwen1.5-7B-Chat, with vocab size 151643, has no such serving problem, why?the function
create_lora_weights
from LogitsProcessorWithLoRA throws this error.Model using llama architecture designate lm_head as a target module for lora, and need instantiate
LogitsProcessorWithLoRA
,refer to: https://github.com/vllm-project/vllm/blob/main/vllm/lora/models.py#438Models such as qwen-2 don't designate lm_head as a target module for lora,so,They don't instantiate
LogitsProcessorWithLoRA
I see, but lm_head is not finetuned during lora, so there is no need to replace logits_processor
. In my adapter_config.json, target_modules does not contains lm_head
"target_modules": [
"gate_proj",
"v_proj",
"q_proj",
"o_proj",
"up_proj",
"k_proj",
"down_proj"
],
Will larger vocabulary size for multi-lora be supported in Q2 2024? Related: #3000
4015 had done this
This is strange, serving lora finetune for Llama-3 (vocab size 12800) has the same problem,
When using LoRA, vocab size must be 32000 >= vocab_size <= 33024
, however same code finetune for Qwen1.5-7B-Chat, with vocab size 151643, has no such serving problem, why?the function
create_lora_weights
from LogitsProcessorWithLoRA throws this error. Model using llama architecture designate lm_head as a target module for lora, and need instantiateLogitsProcessorWithLoRA
,refer to: https://github.com/vllm-project/vllm/blob/main/vllm/lora/models.py#438 Models such as qwen-2 don't designate lm_head as a target module for lora,so,They don't instantiateLogitsProcessorWithLoRA
I see, but lm_head is not finetuned during lora, so there is no need to replace
logits_processor
. In my adapter_config.json, target_modules does not contains lm_head"target_modules": [ "gate_proj", "v_proj", "q_proj", "o_proj", "up_proj", "k_proj", "down_proj" ],
vllm support multi-lora, whether to replace
logits_processor
is determined by the model's support_modules, not by the adapter_config.json.
would like to help with #620
@Jeffwan yes! We are actively with the authors of both papers to integrate the work properly. We are also working with Sarathi's authors for chunked prefill as well.
Looking forward to the release of vllm support for Prefill-Decode Disaggregation feature
@simon-mo Hi, How about https://arxiv.org/abs/2404.18057? It seems to have a significant advantage in long sequences, and it does not conflict with page-attention technology.
@simon-mo Any thing update about the #3117 ? This issue was raised in February, and it has been nearly three months. We sincerely look forward to your updating in this regard, thank you.
@simon-mo Any thing update about the https://github.com/vllm-project/vllm/pull/3117 ? This issue was raised in February, and it has been nearly three months. We sincerely look forward to your updating in this regard, thank you.
Still in progress. @robertgshaw2-neuralmagic can help comment more.
Do you have plans to incorporate RISC-V or ARM CPU backends into the vLLM project? Thank you.
We should consider long-context optimizations for Q3.
Hi - with smaller models being popular these days - I'm wondering, if for Q3, there are any plans for data parallelism support (loading the same model onto gpu's as copies)
If not - I can help with this
do you have plan to support nvidia device jetson with aarch64 ?
Hi - with smaller models being popular these days - I'm wondering, if for Q3, there are any plans for data parallelism support (loading the same model onto gpu's as copies)
If not - I can help with this
Are you thinking this would be something handled internally by LLMEngine
or a new front end that stands in front?
If handled internally, this will require significant changes to the core logic.
Also, if this is targeted at offline batch mode, perhaps we will see some gains, though I suspect not too much since we can saturate the GPU via batching even with TP
If this is targeted at online serving, I do not think we should be implementing a load balancer in vLLM. This should be handled by higher level orchestrators like kuberentes or ray
Hi - with smaller models being popular these days - I'm wondering, if for Q3, there are any plans for data parallelism support (loading the same model onto gpu's as copies) If not - I can help with this
Are you thinking this would be something handled internally by
LLMEngine
or a new front end that stands in front?If handled internally, this will require significant changes to the core logic.
Also, if this is targeted at offline batch mode, perhaps we will see some gains, though I suspect not too much since we can saturate the GPU via batching even with TP
If this is targeted at online serving, I do not think we should be implementing a load balancer in vLLM. This should be handled by higher level orchestrators like kuberentes or ray
My particular use-case is automatic large offline batches, for which I have a hotfix - I spin up multiple OpenAI servers, and distribute the prompts among them. Curiously, I see large speedups when I do this, as opposed to TP.
Also, if this is targeted at offline batch mode, perhaps we will see some gains, though I suspect not too much since we can saturate the GPU via batching even with TP.
I'm not sure if this is a bug or something else, because I did indeed see large speedups with this, when I completely removed ray worker communication (some digging said that the overhead is not worth it). If this is not expected, I can try out some experiments and post them here. (This may be an artifact of me having a PCIE GPU cluster, not sped up by NVLINK)
Hi - with smaller models being popular these days - I'm wondering, if for Q3, there are any plans for data parallelism support (loading the same model onto gpu's as copies) If not - I can help with this
Are you thinking this would be something handled internally by
LLMEngine
or a new front end that stands in front? If handled internally, this will require significant changes to the core logic. Also, if this is targeted at offline batch mode, perhaps we will see some gains, though I suspect not too much since we can saturate the GPU via batching even with TP If this is targeted at online serving, I do not think we should be implementing a load balancer in vLLM. This should be handled by higher level orchestrators like kuberentes or rayMy particular use-case is automatic large offline batches, for which I have a hotfix - I spin up multiple OpenAI servers, and distribute the prompts among them. Curiously, I see large speedups when I do this, as opposed to TP.
Also, if this is targeted at offline batch mode, perhaps we will see some gains, though I suspect not too much since we can saturate the GPU via batching even with TP.
I'm not sure if this is a bug or something else, because I did indeed see large speedups with this, when I completely removed ray worker communication (some digging said that the overhead is not worth it). If this is not expected, I can try out some experiments and post them here. (This may be an artifact of me having a PCIE GPU cluster, not sped up by NVLINK)
Okay great. We would welcome a contribution focused on the offline batch processing case.
Could you make an RFC issue to discuss a potential design? I think we should try hard to not modify LLMEngine and see if we can handle things in the LLM class
Very excited to see function calling support in OpenAI-Compatible server is in this roadmap! This is quite helpful when using LangChain.
@Jeffwan yes! We are actively with the authors of both papers to integrate the work properly. We are also working with Sarathi's authors for chunked prefill as well.
Hi @simon-mo. Is there any update about splitwise? It seems that the development of https://github.com/vllm-project/vllm/pull/2809 has stopped.
Would love to see updates to the docs on how to use supported vision models, embedding models, and the new support for tools with forced tool choice (auto tool choice is still WIP as I understand)
Hi @simon-mo , is there any plan to support Huawei's NPU HardWare ?
Hi @simon-mo , is there any plan to support Huawei's NPU HardWare ? @simon-mo Some company with no moral bottom line, don't have anything to do with them。
Q3 published here #5805
Is function calling available yet?
Is function calling available yet?
Soon, for Hermes and mistral models in #5649
If there are other specific models you're interested in, let me know and I can add it in my follow up PR along with Llama 3.1
Is function calling available yet?
Soon, for Hermes and mistral models in #5649
If there are other specific models you're interested in, let me know and I can add it in my follow up PR along with Llama 3.1
So initially Llama not included ? Thanks
If there are other specific models you're interested in, let me know and I can add it in my follow up PR along with Llama 3.1
@K-Mistele Is there a PR or issue I can follow for function calling support with Llama 3.1 (70B specifically)?
If there are other specific models you're interested in, let me know and I can add it in my follow up PR along with Llama 3.1
@K-Mistele Is there a PR or issue I can follow for function calling support with Llama 3.1 (70B specifically)?
There is a branch on my vLLM fork, but not a PR yet since #5649 needs to be merged before I open another PR based on it.
This document includes the features in vLLM's roadmap for Q2 2024. Please feel free to discuss and contribute to the specific features at related RFC/Issues/PRs and add anything else you'd like to talk about in this issue.
You can see our historical roadmap at #2681, #244. This roadmap contains work committed by the vLLM team from UC Berkeley, as well as the broader vLLM contributor groups including but not limited to Anyscale, IBM, NeuralMagic, Roblox, Oracle Cloud. You can also find help wanted items in this roadmap as well! Additionally, this roadmap is shaped by you, our user community!
Themes.
We categorized our roadmap into 6 broad themes:
Broad Model Support
Help Wanted:
transformers
text generation modelExcellent Hardware Coverage
Performance Optimization
Help Wanted:
Production Level Engine
Help Wanted:
Strong OSS Product
Help Wanted:
lm-eval-harness
(logprobs, get tokenizers)Extensible Architecture
torch.compile
investigations