zhuohan123 commented 9 months ago

This document includes the features in vLLM's roadmap for Q1 2024. Please feel free to discuss and contribute to the specific features at related RFC/Issues/PRs and add anything else you'd like to talk about in this issue.

In the future, we will publish our roadmap quarterly and deprecate our old roadmap (#244).

OSS General
- [x] Better benchmark scripts and standards (#2433)
- [ ] Improve documentation
- CI/CD Testing and release process
- [ ] Make model and kernel tests working on current CI
- [ ] Automate release process
- [ ] Dev experience
- [ ] Explore Apple Silicon via Torch or MLX or llama cpp
- [ ] Cached and parallel build system (#2654)
Frontend
- [x] Support structured output (contact: @simon-mo)
- [ ] Optimize the performance of the API server
Scheduling
- [ ] Chunked prefill / dynamic splitfuse (#1562)
- [ ] Speculative decoding (#2188, #2607, merging plan, contact: @LiuXiaoxuanPKU)
- [x] Automatic prefix caching (#2614, contact: @zhuohan123)
- [ ] Disaggregated prefill / splitwise (#2472)
Kernel performance optimization
- [ ] Quantization kernel optimization
- [ ] Support FP8 (#2461)
- [ ] MoE kernel optimization (#2453, #2542, and more)
- [ ] H100 performance (#2107)
- [ ] AMD MI300x Performance
- [ ] MQA kernel (#1880)
- [ ] Port FlashInfer to vLLM (#2767)
- [ ] Kernel for sampler
- Hardware support vLLM team is working with the following hardware vendors:
- [x] AWS Inferentia (#1866, #2569)
- [ ] Google TPU
- [ ] Intel Gaudi
- [ ] Intel GPU/CPU (#2378)
- Model support
- [x] Multi-modal models (#775, #1265, #2153, #2563)
- [ ] Encoder-decoder models (#187, #404)
- [ ] Embedding models (#458, #742)
- Future-proofing vLLM
- [ ] torch.compile support
- [ ] Implement extensible scheduler and memory manager

sandangel commented 9 months ago

Is it possible to support mlx for running inference on Mac devices? That would simplify the local development and running on cloud.

AguirreNicolas commented 9 months ago

As mentioned in #2643, it would be awesome to have vLLM /completions & /chat/completions endpoints both supporting logprobs to run lm-eval-harness.

PeterXiaTian commented 9 months ago

please take attention with "Evaluation of Accelerated and Non-Accelerated Large Model Output",it is very important and make sure they are always same

jrruethe commented 9 months ago

As mentioned in #2643, it would be awesome to have vLLM /completions & /chat/completions endpoints both supporting logprobs to run lm-eval-harness.

Agree 100%, the ability to use lm-eval-harness is very much needed