vllm-project vllm issues

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

30.64k stars 4.65k forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

[Model] Fix Baichuan BNB online quantization

#10572 CNTRYROA opened 4 hours ago
1
[V1] Refactor model executable interface for multimodal models

#10570 ywang96 opened 4 hours ago
1
[Bug]: llama-3.2-11B-vision run in vllm==0.6.3 OOM error（L20）

#10569 Jamrainbow opened 4 hours ago
2
[Bugfix]Fix Baichuan BNB online quantization

#10568 CNTRYROA closed 4 hours ago
1
[Bugfix] 500 Internal Server Error when tool_choice is incorrect.

#10567 shenoyvvarun opened 4 hours ago
1
[Doc]: Docker+vllm+fastchat deploys multimodal large model Qwen2-vl-7b-instruct(docker+vllm+fastchat部署多模态大模型Qwen2-vl-7b-instruct)

#10566 Aanlifire opened 5 hours ago
0
[Hardware][Intel-Gaudi] Enable LoRA support for Intel Gaudi (HPU)

#10565 SanjuCSudhakaran opened 5 hours ago
2
[V1] EngineCore supports profiling

#10564 Abatom opened 6 hours ago
1
[Minor] Fix line-too-long

#10563 WoosukKwon closed 8 hours ago
1
[Feature]: How to run speculative models with tensor parallelism?

#10562 cxxuser opened 8 hours ago
1
[Model] Added GLM-4 series model support vllm==0.6.4

#10561 sixsixcoder opened 9 hours ago
1
add cleaned multi memory support

#10560 ClarkChin08 closed 10 hours ago
1
[Bug]: Speculative Decoding without enabling eager mode returns gibberish output after some tokens.

#10559 andoorve opened 11 hours ago
0
[torch.compile] support all attention backends

#10558 youkaichao opened 11 hours ago
1
[Benchmark] Benchmark structured output with datasets

#10557 xuechendi opened 11 hours ago
1
[Usage]: Docker w/ CPU fails when defining VLLM_CPU_OMP_THREADS_BIND

#10556 ccruttjr opened 12 hours ago
5
[platforms] absorb worker cls difference into platforms folder

#10555 youkaichao closed 7 hours ago
7
[Docs] Add dedicated tool calling page to docs

#10554 mgoin opened 13 hours ago
2
[misc] improve error message

#10553 youkaichao closed 14 hours ago
1
[9/N] torch.compile LLM usage

#10552 youkaichao closed 8 hours ago
1
Remove token-adding chat embedding params

#10551 noamgat closed 4 hours ago
3
Add small example to metrics.rst

#10550 mgoin closed 12 hours ago
1
support bitsandbytes quantization with qwen model

#10549 zixuanzhang226 opened 16 hours ago
1
[Usage]: Can we extend the context length of gemma2 model or other models?

#10548 hahmad2008 opened 16 hours ago
2
[Benchmark] Add new H100 machine

#10547 simon-mo closed 9 hours ago
1
[Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server

#10546 angkywilliam opened 16 hours ago
6
[Minor] Revert change in offline inference example

#10545 WoosukKwon closed 14 hours ago
1
Update default max_num_batch_tokens for chunked prefill to 2048

#10544 mgoin opened 19 hours ago
2
[do-not-merge] Ibm 20241121

#10543 fialhocoelho closed 20 hours ago
2
[Distributed] Tensor Parallel RMSNorm

#10542 tlrmchlsmth opened 20 hours ago
2
[Bugfix][Hardware][CPU] Fix `multi_modal_kwargs` broadcast for CPU tensor parallel

#10541 Isotr0py opened 20 hours ago
2
[Installation]: can't get the cu118 version of vllm 0.6.3 by https://github.com/vllm-project/vllm/releases/download/v0.6.3/vllm-0.6.3+cu118-cp310-cp310-manylinux1_x86_64.whl

#10540 mayfool opened 21 hours ago
0
[Feature]: Support for Registering Model-Specific Default Sampling Parameters

#10539 yansh97 opened 21 hours ago
1
For ppc64le, disabled tests for now and addressed space issues

#10538 npanpaliya opened 22 hours ago
2
[Usage]: How to use ROPE scaling for llama3.1 and gemma2?

#10537 hahmad2008 opened 22 hours ago
2
[Bugfix] Allow token ID-only inputs in Qwen2-Audio

#10536 DarkLight1337 closed 17 hours ago
1
[CI][Installation] Avoid uploading CUDA 11.8 wheel

#10535 cermeng closed 15 hours ago
6
[Usage]: Fail to load params.json

#10534 dequeueing opened 23 hours ago
3
[Bug]: vllm failed to run two instance with one gpu

#10533 pandada8 closed 23 hours ago
3
Add Sageattention backend

#10532 flozi00 opened 1 day ago
3
[Bug]: Authorization ignored when root_path is set

#10531 OskarLiew opened 1 day ago
3
[Misc] Suppress duplicated logging regarding multimodal input pipeline

#10530 ywang96 closed 18 hours ago
5
[8/N] enable cli flag without a space

#10529 youkaichao closed 15 hours ago
2
[V1] Fix Compilation config & Enable CUDA graph by default

#10528 WoosukKwon closed 15 hours ago
1
[Usage]: Optimizing TTFT for Qwen2.5-72B Model Deployment on A800 GPUs for RAG Application

#10527 zhanghx0905 opened 1 day ago
2
[Feature]: Additional possible value for `tool_choice`: `required`

#10526 fahadh4ilyas opened 1 day ago
1
[Bug]: Gemma2 becomes a fool.

#10525 Foreist opened 1 day ago
5
fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len

#10524 sywangyi closed 1 day ago
7
[Bug]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1

#10523 QualityGN opened 1 day ago
0
[Kernel] Register punica ops directly

#10522 jeejeelee closed 18 hours ago
2