vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
21.88k stars 3.08k forks source link

[Roadmap] vLLM Roadmap Q3 2024 #5805

Open simon-mo opened 6 days ago

simon-mo commented 6 days ago

Anything you want to discuss about vllm.

This document includes the features in vLLM's roadmap for Q3 2024. Please feel free to discuss and contribute, as this roadmap is shaped by the vLLM community.

Themes.

As before, we categorized our roadmap into 6 broad themes:

Broad Model Support

Help wanted:

Hardware Support

Performance Optimizations

Production Features

Help wanted

OSS Community

Help wanted

Extensible Architecture


If any of the item you wanted is not on the roadmap, your suggestion and contribution is still welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Jeffwan commented 6 days ago

Support multiple models in the same server

Does vLLM need the multi-model support similar like what FastChat does or something else?

CSEEduanyu commented 5 days ago

https://github.com/vllm-project/vllm/pull/2809 hello,how about this?

jeejeelee commented 4 days ago

Hi, the issues were mentioned in https://github.com/vllm-project/vllm/pull/5036 and should be taken into account.

MeJerry215 commented 3 days ago

Will vLLM use Triton more to optimize operators' performance in future, or will it consider using the torch.compile mechanism more?

And are there any plans for this?

ashim-mahara commented 3 days ago

Hi! Is there or will there be support for the OpenAI Batch API ?

huseinzol05 commented 2 days ago

I am doing for Whisper, my fork at https://github.com/mesolitica/vllm-whisper, the frontend later should compatible with OpenAI API plus able to stream output tokens, few hiccups, still trying to figure out based on T5 branch, https://github.com/vllm-project/vllm/blob/9f20ccf56b63b0b47e09069615e023287f1681f8/vllm/model_executor/layers/enc_dec_attention.py#L83

  1. still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.
  2. No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal
  3. Reuse KV Cache Cross Attention from the first step for the next steps.
huseinzol05 commented 2 days ago

Able to load and infer, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, but the output is still trash, might be bugs related to weights or the attention, still debugging