vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.1k stars 4.34k forks source link

[Usage]: Manually Increasing inference time #9274

Open Playerrrrr opened 2 weeks ago

Playerrrrr commented 2 weeks ago

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

I am currently running Qwen2.5-72b-instruct on a DGX PCIE server with VLLM as the inference engine. Inspired by the ideas of Noam Brown on how they have reached the o1 idea of scaling the inference time, been wondering whether its possible to manually increase the inference time the Qwen2.5 that im running on my server. Thanks in advance to all the kind community members of VLLM. :D To infinite(y) and beyond compute!

Before submitting a new issue...

noooop commented 2 weeks ago

PTAL #8633

(I guess you don't really want to slow down inference physically