Open pseudotensor opened 2 months ago
This suggests that it's supported, but doesn't seem to be. And also seems vllm only supports static yarn that isn't good in general.
For deployment, we recommend using vLLM. Please refer to our [Documentation](https://qwen.readthedocs.io/en/latest/deployment/vllm.html) for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required
I guess answer is no, dynamic YARN is not supported and default rope_scaling is ignored. So please convert this issue into feature request.
@jeejeelee You shared the same link I already shared. The issue can be converted to a feature request.
Have you already tried adding the YaRN config in the following way?
For supported frameworks, you could add the following to config.json to enable YaRN:
{
...,
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
}
No because they say that vLLM only supports static YARN that is bad for short context cases. I'm asking if this issue can be converted to a feature request to support dynamic YARN to support this model properly/fully.
We also care about QWen's long conext ability and performance, so we conducted tests, and it seems the impact is very minimal. If any issues exist with these tests, please let me know.
@jeejeelee long text means input len is longer than 32K :) can you check that? thanks
@jeejeelee long text means input len is longer than 32K :) can you check that? thanks
I just want to verify vLLM only supports static YARN that is bad for short context cases
Supporting YARN means allowing input longer than 32K, and we observe there is a huge difference between.
is there a way to override RoPE / YaRN config parameters from config.json
at startup time? It seems like this would be a desirable feature to have, both for qwen 2.5, as well as for other model families like llama 3.0 which RoPE scaled very well but which require digging config.json
out of your hugging face cache directory.
If not, I will be happy to open a feature request and/or implement this, I just want to make sure that I'm not duplicating work that's already been done.
We also care about QWen's long conext ability and performance, so we conducted tests, and it seems the impact is very minimal. If any issues exist with these tests, please let me know.
The doubt I have is the test only goes up to 8k (if I am reading that right), instead of the 32k native context size. But this is only showing efficiency, you didn't generate ppl difference numbers (like, in quantization, I would generate, and in this case would be $\ln(PPL(\text(yarn))/PPL(\text(no yarn)))$ ) for performance?
edit: actually, am I reading that right? the batch size is 8k but the μbatch size is still 1k, so you're never doing a feed forward on more than 1k at a time, right?
📚 The doc issue
https://huggingface.co/Qwen/Qwen2.5-72B-Instruct#processing-long-texts
But when starting like this:
I get failure:
Suggest a potential alternative/fix
Unsure if supposed to be supported or not.
Before submitting a new issue...