vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.11k stars 4.73k forks source link

[Doc]: Is Qwen2.5's long context YARN handled? #8793

Open pseudotensor opened 2 months ago

pseudotensor commented 2 months ago

📚 The doc issue

https://huggingface.co/Qwen/Qwen2.5-72B-Instruct#processing-long-texts

But when starting like this:

docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=4,5,6,7"' \
    --shm-size=10.24gb \
    -p 5001:5001 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name qwen25_72b \
      vllm/vllm-openai:latest \
        --port=5001 \
        --host=0.0.0.0 \
        --model=Qwen/Qwen2.5-72B-Instruct \
        --tensor-parallel-size=4 \
        --seed 1234 \
        --trust-remote-code \
        --max-model-len=131072 \
        --max-num-batched-tokens 131072 \
        --max-log-len=100 \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.qwen25_72b.txt

I get failure:

ValueError: User-specified max_model_len (131072) is greater than the derived max_model_len (max_position_embeddings=32768 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. To allow overriding this maximum, set the env var VLLM_ALLOW_LONG_MAX_MODEL_
LEN=1

Suggest a potential alternative/fix

Unsure if supposed to be supported or not.

Before submitting a new issue...

pseudotensor commented 2 months ago

This suggests that it's supported, but doesn't seem to be. And also seems vllm only supports static yarn that isn't good in general.

For deployment, we recommend using vLLM. Please refer to our [Documentation](https://qwen.readthedocs.io/en/latest/deployment/vllm.html) for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required
pseudotensor commented 2 months ago

I guess answer is no, dynamic YARN is not supported and default rope_scaling is ignored. So please convert this issue into feature request.

jeejeelee commented 2 months ago

see: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct#processing-long-texts

pseudotensor commented 2 months ago

@jeejeelee You shared the same link I already shared. The issue can be converted to a feature request.

jeejeelee commented 2 months ago

Have you already tried adding the YaRN config in the following way?

For supported frameworks, you could add the following to config.json to enable YaRN:

{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}
pseudotensor commented 2 months ago

No because they say that vLLM only supports static YARN that is bad for short context cases. I'm asking if this issue can be converted to a feature request to support dynamic YARN to support this model properly/fully.

jeejeelee commented 2 months ago

We also care about QWen's long conext ability and performance, so we conducted tests, and it seems the impact is very minimal. image If any issues exist with these tests, please let me know.

ericg108 commented 2 months ago

@jeejeelee long text means input len is longer than 32K :) can you check that? thanks

jeejeelee commented 2 months ago

@jeejeelee long text means input len is longer than 32K :) can you check that? thanks

I just want to verify vLLM only supports static YARN that is bad for short context cases

tszdanger commented 1 month ago

Supporting YARN means allowing input longer than 32K, and we observe there is a huge difference between.

K-Mistele commented 1 month ago

is there a way to override RoPE / YaRN config parameters from config.json at startup time? It seems like this would be a desirable feature to have, both for qwen 2.5, as well as for other model families like llama 3.0 which RoPE scaled very well but which require digging config.json out of your hugging face cache directory.

If not, I will be happy to open a feature request and/or implement this, I just want to make sure that I'm not duplicating work that's already been done.

robbiemu commented 2 weeks ago

We also care about QWen's long conext ability and performance, so we conducted tests, and it seems the impact is very minimal. image If any issues exist with these tests, please let me know.

The doubt I have is the test only goes up to 8k (if I am reading that right), instead of the 32k native context size. But this is only showing efficiency, you didn't generate ppl difference numbers (like, in quantization, I would generate, and in this case would be $\ln(PPL(\text(yarn))/PPL(\text(no yarn)))$ ) for performance?

edit: actually, am I reading that right? the batch size is 8k but the μbatch size is still 1k, so you're never doing a feed forward on more than 1k at a time, right?