[Roadmap] vLLM Roadmap Q4 2024

simon-mo commented 1 month ago

This page is accessible via roadmap.vllm.ai

Themes.

As before, we categorized our roadmap into 6 broad themes: broad model support, wide hardware coverage, state of the art performance optimization, production level engine, strong OSS community, and extensible architectures. As we are seeing more

Broad Model Support

[ ] Enhance LLM Support
- [ ] Hybrid/Interleaved Attention (#9464)
[ ] Enhance Multi-Modality in vLLM (#4194)
[ ] Enhance Support for State Space Models (Mamba)
[ ] Reward Model API (#8967)
[ ] Arbitrary HF model (a collaboration with Hugging Face!)
[ ] Whisper

Help wanted:

[ ] Expand coverage for encoder-decoder models (Bert, XLMRoberta, BGE, T5) (#5447)
[ ] API for streaming input (in particular for audio)

Hardware Support

[ ] A feature matrix for all the hardware that vLLM supports, and their maturity level
[ ] Expanding features support on various hardwares
- [ ] Fast PagedAttention and Chunked Prefill on Inferentia
- [ ] Upstream of Intel Gaudi
- [ ] Enhancements in TPU Support
- [ ] Upstream enhancements in AMD MI300x
- [ ] Performance enhancement and measurement for NVIDIA H200
- [ ] New accelerator support: IBM Spyre

Help wanted:

[ ] Design for pluggable, out-of-tree hardware backend similar to PyTorch’s PrivateUse API
[ ] Prototype JAX support

Performance Optimizations

[ ] Turn on chunked prefill, prefix caching, speculative decoding by default
[ ] Optimizations for structured outputs
[ ] Fused GEMM/all-reduce leveraging Flux and AsyncTP
[ ] Enhancement and overhead-removal in offline LLM use cases.
[ ] Better kernels (FA3, FlashInfer, FlexAttention, Triton)
[ ] Native integration with torch.compile

Help wanted:

[ ] A fast ngrams speculator
[ ] Sparse KV cache framework (#5751)
[ ] Long context optimizations: context parallelism, etc.

Production Features

[ ] KV cache offload to CPU and disk
[ ] Disaggregated Prefill
[ ] More control in prefix caching, and scheduler policies
[ ] Automated speculative decoding policy, see Dynamic Speculative Decoding

Help wanted

[ ] Support multiple models in the same server

OSS Community

[ ] Enhancements in performance benchmark: more realistic workload, more hardware backends (H200s)
[ ] Better developer documentations for getting started with contribution and research

Help wanted

[ ] Documentation enhancements in general (styling, UI, explainers, tutorials, examples, etc)

Extensible Architecture

[ ] Full support for torch.compile
[ ] vLLM Engine V2: Asynchronous Scheduling and Prefix Caching Centric Design (#8779)
[ ] A generic memory manager supporting multi-modality, sparsity, and others

If any of the items you wanted is not on the roadmap, your suggestion and contribution is still welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Historical Roadmap: #5805, #3861, #2681, #244

IsaacRe commented 1 month ago

Support for KV cache compression

[ ] upstream https://github.com/IsaacRe/vllm-kvcompress/tree/main - related issues (3532, 5751)

ksjadeja commented 1 month ago

Do we have plans to support https://github.com/vllm-project/vllm/issues/5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards.

sylviayangyy commented 1 month ago

Hi, do we have any follow-up issue or Slack channel for the "KV cache offload to CPU and disk" task? Our team has previously explored some "KV cache offload" work based on vLLM, and we’d be happy to join any relevant discussion or contribute to the development if there's such chance~

Personally, also looking forward to know more about "More control in prefix caching, and scheduler policies" part😊.

zeroorhero commented 1 month ago

@simon-mo hi，regarding the topic “KV cache offload to CPU and disk”, I previously implemented a version that stores kv cache in a local file(https://github.com/vllm-project/vllm/pull/8018). Of course, I also did relevant abstractions and can add other media. Is there a slack channel for this? We can discuss the specific scheme. I am also quite interested in this function.

simon-mo commented 1 month ago

@sylviayangyy @zeroorhero thank you for your interests! Yes. @KuntaiDu has created a #feat-kvcache-offloading to discuss that.

jeejeelee commented 1 month ago

Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards.

It looks like LoRA is now supported. Are you encountering any issues?

iiLaurens commented 1 month ago

Any plans on improving guided decoding? There's a long standing RFC for it (#5423) and previous attempts have been made (e.g. #6273). Unfortunately seems to have been forgotten since.

In particular I'd love to see it become async (logit mask or biases can be calculated while GPU is working on calculating logits) and fast forwarding tokens when the next few tokens are deterministic.

HuYunhai-Alex commented 1 month ago

Whether there is an opportunity to participate in changes related to speculative decoding? I'm working on some of the practices that are going to help you

devdev999 commented 1 month ago

Any plans on improving guided decoding? There's a long standing RFC for it (#5423) and previous attempts have been made (e.g. #6273). Unfortunately seems to have been forgotten since.

In particular I'd love to see it become async (logit mask or biases can be calculated while GPU is working on calculating logits) and fast forwarding tokens when the next few tokens are deterministic.

I second this. We are using vLLM to host our production inference servers and all of our downstream applications rely on guided json decoding to ensure that output is parsable. There is a significant performance difference between guided and non-guided decoding and any performance improvements would be helpful to increase throughput.

Harsha-Nori commented 1 month ago

Any plans on improving guided decoding? There's a long standing RFC for it (#5423) and previous attempts have been made (e.g. #6273). Unfortunately seems to have been forgotten since. In particular I'd love to see it become async (logit mask or biases can be calculated while GPU is working on calculating logits) and fast forwarding tokens when the next few tokens are deterministic.

I second this. We are using vLLM to host our production inference servers and all of our downstream applications rely on guided json decoding to ensure that output is parsable. There is a significant performance difference between guided and non-guided decoding and any performance improvements would be helpful to increase throughput.

Hey, I maintain the guidance project and we worked on the first proposal in #6273 . Looks like vLLM has changed significantly since then, but if there is appetite for upgraded/more performant guided decoding work from the maintainers, we're happy to take another look and investigate a new PR. In particular, guidance (and our high performance rust implementation in llguidance already does async computations on CPU, calculates fast forward tokens, etc. and is typically accelerative for JSON schema.

@JC1DA @mmoskal

ksjadeja commented 3 weeks ago

Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards.

It looks like LoRA is now supported. Are you encountering any issues?

Yes, if we look at the class in mixtral_quant.py, it does not have SupportsLora which means lora is not supported for quantized Mixtral. but for mixtral.py, we have SupportsLora included in MixtralForCausalLM. I have a LORA adapter trained which I want to use on top of mixtral-awq model without merging, directly as a hot swap. Let me know if you know a better way to tackle this situation

jeejeelee commented 3 weeks ago

Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards.

It looks like LoRA is now supported. Are you encountering any issues?

Yes, if we look at the class in mixtral_quant.py, it does not have SupportsLora which means lora is not supported for quantized Mixtral. but for mixtral.py, we have SupportsLora included in MixtralForCausalLM. I have a LORA adapter trained which I want to use on top of mixtral-awq model without merging, directly as a hot swap. Let me know if you know a better way to tackle this situation

I'm guessing you explicitly set the quantization, right? If so, you can try removing that argument and test it out, like the following script:

llm = LLM(
    model="Mixtral-8x7B-Instruct-v0.1-GPTQ",
    trust_remote_code=True,
    gpu_memory_utilization=0.6,
    enable_lora=True,
)

dbuades commented 3 weeks ago

Any plans on improving guided decoding? There's a long standing RFC for it (#5423) and previous attempts have been made (e.g. #6273). Unfortunately seems to have been forgotten since. In particular I'd love to see it become async (logit mask or biases can be calculated while GPU is working on calculating logits) and fast forwarding tokens when the next few tokens are deterministic.

I second this. We are using vLLM to host our production inference servers and all of our downstream applications rely on guided json decoding to ensure that output is parsable. There is a significant performance difference between guided and non-guided decoding and any performance improvements would be helpful to increase throughput.

Hey, I maintain the guidance project and we worked on the first proposal in #6273 . Looks like vLLM has changed significantly since then, but if there is appetite for upgraded/more performant guided decoding work from the maintainers, we're happy to take another look and investigate a new PR. In particular, guidance (and our high performance rust implementation in llguidance already does async computations on CPU, calculates fast forward tokens, etc. and is typically accelerative for JSON schema.

@JC1DA @mmoskal

Improvements in guided generation performance would be very welcome. There is a helpful comment by @stas00 from last month with a nice summary of where things currently stand.

ksjadeja commented 3 weeks ago

Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards.

It looks like LoRA is now supported. Are you encountering any issues?

Yes, if we look at the class in mixtral_quant.py, it does not have SupportsLora which means lora is not supported for quantized Mixtral. but for mixtral.py, we have SupportsLora included in MixtralForCausalLM. I have a LORA adapter trained which I want to use on top of mixtral-awq model without merging, directly as a hot swap. Let me know if you know a better way to tackle this situation

I'm guessing you explicitly set the quantization, right? If so, you can try removing that argument and test it out, like the following script:
llm = LLM(
    model="Mixtral-8x7B-Instruct-v0.1-GPTQ",
    trust_remote_code=True,
    gpu_memory_utilization=0.6,
    enable_lora=True,
)

Tried this, but does not work. I get the same error. Just mentioning that I use awq quantized model [rank0]: ValueError: Model MixtralForCausalLM does not support LoRA, but LoRA is enabled. Support for this model may be added in the future. If this is important to you, please open an issue on github.

jeejeelee commented 3 weeks ago

Do we have plans to support #5540? We are having a production level use case and would really appreciate if someone can look into it for Q4 onwards.

It looks like LoRA is now supported. Are you encountering any issues?

Yes, if we look at the class in mixtral_quant.py, it does not have SupportsLora which means lora is not supported for quantized Mixtral. but for mixtral.py, we have SupportsLora included in MixtralForCausalLM. I have a LORA adapter trained which I want to use on top of mixtral-awq model without merging, directly as a hot swap. Let me know if you know a better way to tackle this situation

I'm guessing you explicitly set the quantization, right? If so, you can try removing that argument and test it out, like the following script:
llm = LLM(
    model="Mixtral-8x7B-Instruct-v0.1-GPTQ",
    trust_remote_code=True,
    gpu_memory_utilization=0.6,
    enable_lora=True,
)
Tried this, but does not work. I get the same error. Just mentioning that I use awq quantized model [rank0]: ValueError: Model MixtralForCausalLM does not support LoRA, but LoRA is enabled. Support for this model may be added in the future. If this is important to you, please open an issue on github.

Which vllm version are you using?

According to the code in https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/model_executor/model_loader/utils.py#L30, both GPTQ and AWQ quantization methods should be compatible when using version 0.6.3post1

Edenzzzz commented 1 week ago

Any interest in vAttention? https://github.com/vllm-project/vllm/issues/4675

niuzheng168 commented 1 week ago

More and more speech model is using a LLM to predict non-text tokens. Like ChatTTS or FishTTS, they are all using a llama to predict speech tokens. But unlike llama for text, the speech-llama will use a multiple lm_head to predict more than 1 tokens in parallel, and therefor sum the n-tokens embedding when processing the llm input embedding . I am currently trying to make chattts running with vllm, see here, but lots code need to update and seems break some fundamental design. So maybe you can consider support it officially. It will definitely make more impact to the speech solutions.

kentoym commented 1 week ago

Any plans on improving guided decoding? There's a long standing RFC for it (#5423) and previous attempts have been made (e.g. #6273). Unfortunately seems to have been forgotten since. In particular I'd love to see it become async (logit mask or biases can be calculated while GPU is working on calculating logits) and fast forwarding tokens when the next few tokens are deterministic.

I second this. We are using vLLM to host our production inference servers and all of our downstream applications rely on guided json decoding to ensure that output is parsable. There is a significant performance difference between guided and non-guided decoding and any performance improvements would be helpful to increase throughput.

Hey, I maintain the guidance project and we worked on the first proposal in #6273 . Looks like vLLM has changed significantly since then, but if there is appetite for upgraded/more performant guided decoding work from the maintainers, we're happy to take another look and investigate a new PR. In particular, guidance (and our high performance rust implementation in llguidance already does async computations on CPU, calculates fast forward tokens, etc. and is typically accelerative for JSON schema. @JC1DA @mmoskal

Improvements in guided generation performance would be very welcome. There is a helpful comment by @stas00 from last month with a nice summary of where things currently stand.

Do we have plans to improve concurrency performance for guided decoding? Enabling guided_json for concurrent requests results in significant throughput and latency degradation. (#3567)

Enhancements in concurrency performance for guided decoding would greatly benefit high-volume, real-time applications.

Harsha-Nori commented 1 week ago

Quick update -- we've made an initial PR to support guidance as a backend, which does improve performance over current implementations (https://github.com/vllm-project/vllm/pull/10217). Of course, better support for concurrency in general would also help guidance get significantly faster. Happy to support there and help if we can too!

@JC1DA

vllm-project / vllm