vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.02k stars 4.54k forks source link

[Doc]: Is Qwen2.5's long context YARN handled? #8793

Open pseudotensor opened 1 month ago

pseudotensor commented 1 month ago

📚 The doc issue

https://huggingface.co/Qwen/Qwen2.5-72B-Instruct#processing-long-texts

But when starting like this:

docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=4,5,6,7"' \
    --shm-size=10.24gb \
    -p 5001:5001 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name qwen25_72b \
      vllm/vllm-openai:latest \
        --port=5001 \
        --host=0.0.0.0 \
        --model=Qwen/Qwen2.5-72B-Instruct \
        --tensor-parallel-size=4 \
        --seed 1234 \
        --trust-remote-code \
        --max-model-len=131072 \
        --max-num-batched-tokens 131072 \
        --max-log-len=100 \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.qwen25_72b.txt

I get failure:

ValueError: User-specified max_model_len (131072) is greater than the derived max_model_len (max_position_embeddings=32768 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. To allow overriding this maximum, set the env var VLLM_ALLOW_LONG_MAX_MODEL_
LEN=1

Suggest a potential alternative/fix

Unsure if supposed to be supported or not.

Before submitting a new issue...

pseudotensor commented 1 month ago

This suggests that it's supported, but doesn't seem to be. And also seems vllm only supports static yarn that isn't good in general.

For deployment, we recommend using vLLM. Please refer to our [Documentation](https://qwen.readthedocs.io/en/latest/deployment/vllm.html) for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required
pseudotensor commented 1 month ago

I guess answer is no, dynamic YARN is not supported and default rope_scaling is ignored. So please convert this issue into feature request.

jeejeelee commented 1 month ago

see: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct#processing-long-texts

pseudotensor commented 1 month ago

@jeejeelee You shared the same link I already shared. The issue can be converted to a feature request.

jeejeelee commented 1 month ago

Have you already tried adding the YaRN config in the following way?

For supported frameworks, you could add the following to config.json to enable YaRN:

{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}
pseudotensor commented 1 month ago

No because they say that vLLM only supports static YARN that is bad for short context cases. I'm asking if this issue can be converted to a feature request to support dynamic YARN to support this model properly/fully.

jeejeelee commented 1 month ago

We also care about QWen's long conext ability and performance, so we conducted tests, and it seems the impact is very minimal. image If any issues exist with these tests, please let me know.

ericg108 commented 1 month ago

@jeejeelee long text means input len is longer than 32K :) can you check that? thanks

jeejeelee commented 1 month ago

@jeejeelee long text means input len is longer than 32K :) can you check that? thanks

I just want to verify vLLM only supports static YARN that is bad for short context cases

tszdanger commented 2 weeks ago

Supporting YARN means allowing input longer than 32K, and we observe there is a huge difference between.

K-Mistele commented 1 week ago

is there a way to override RoPE / YaRN config parameters from config.json at startup time? It seems like this would be a desirable feature to have, both for qwen 2.5, as well as for other model families like llama 3.0 which RoPE scaled very well but which require digging config.json out of your hugging face cache directory.

If not, I will be happy to open a feature request and/or implement this, I just want to make sure that I'm not duplicating work that's already been done.