vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.2k stars 4.74k forks source link

[Usage]: Manually Increasing inference time #9274

Open Playerrrrr opened 1 month ago

Playerrrrr commented 1 month ago

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

I am currently running Qwen2.5-72b-instruct on a DGX PCIE server with VLLM as the inference engine. Inspired by the ideas of Noam Brown on how they have reached the o1 idea of scaling the inference time, been wondering whether its possible to manually increase the inference time the Qwen2.5 that im running on my server. Thanks in advance to all the kind community members of VLLM. :D To infinite(y) and beyond compute!

Before submitting a new issue...

noooop commented 1 month ago

PTAL #8633

(I guess you don't really want to slow down inference physically