v0.5.2, v0.5.3, v0.6.0 Release Tracker

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

25.24k stars 3.64k forks source link

v0.5.2, v0.5.3, v0.6.0 Release Tracker #6434

Closed simon-mo closed 2 weeks ago

simon-mo commented 1 month ago

Anything you want to discuss about vllm.

We will make a triplet of releases in the following 3 weeks.

[x] v0.5.2 on Monday July 15th.
[x] v0.5.3 by Tuesday July 23rd.
[ ] v0.6.0 after Monday July 29th.

Blockers

[x] #6463
[x] #6517
[x] #6698
[ ] Test vLLM works with 405B that's num_kv_heads=8 instead of 16.

~The reason for such pace is that we want to remove beam search (#6226), which unlocks a suite of scheduler refactoring to enhance performance (async scheduling to overlap scheduling and forward pass for example). We want to release v0.5.2 ASAP to issue warnings and uncover new signals. Then we will decide the removal in v0.6.0. Normally we will deprecate slowly by stretching it by one month or two. However, (1) RFC has been opened for a while (2) it is unfortunately on the critical path of refactoring and performance enhancements.~

Please also feel free to add release blockers. But do keep in mind that I will not slow the release for v0.5.* series unless critical bug.

WoosukKwon commented 1 month ago

July 23rd is Tuesday. Do you mean July 24th?

simon-mo commented 1 month ago

v0.5.2 has been released: https://github.com/vllm-project/vllm/releases/tag/v0.5.2

sasha0552 commented 1 month ago

Hello. Can #4409 be included in any of the next releases? Or at least, can I get an explanation as to why it can't be included (maybe I can help in some way?).

Once the wheel size limit increase was approved, and #6394 was merged, wheel size should not be an issue.

I am currently waiting for PyPI staff to approve the wheel size limit increase request to publish the patched triton to PyPI (pypi/support#4295).

It would be nice to see support for Pascal GPUs in vLLM. Many people use them because they are cheap.

simon-mo commented 1 month ago

Hi @sasha0552,

Thank you for bring this up. For now, would you mind maintain this in your fork? There are few reasons that we are hesitant to include support for Pascals:

Aside from Triton, we are continuously relying on Cutlass, FlashAttention, and FlashInfer which all seems to dropped Pascal.
It is sufficiently easy to build from source in vLLM with Pascal support.
As we add more features and performance optimizations, we are afraid we can no longer test and maintain the support for support for Pascal due to added complexity.

AlphaINF commented 1 month ago

Can this PR be added in v0.5.3? https://github.com/vllm-project/vllm/pull/5036

simon-mo commented 1 month ago

@AlphaINF unlikely given the current state of the PR at the moment (still being reviewed). but I'm looking very much forward to this PR as well!

AlphaINF commented 1 month ago

@simon-mo thanks!

bohr commented 1 month ago

@simon-mo for this "async scheduling to overlap scheduling" do we have a plan？

AlphaINF commented 2 weeks ago

hello, when will v0.6.0 release? I'm looking forward to https://github.com/vllm-project/vllm/pull/5036 and MiniCPM-Llama3-V-2_5

vrdn-23 commented 2 weeks ago

Would it be possible to get #6594 merged in before the next release is due? @joerunde @Yard1