Open Downtown-Case opened 1 week ago
cc @ArthurZucker @gante
I am not sure if it's a bug, the doc is probably wrong as for transformers specifically you need max_position_embeddings
. If VLLM reads origina_max_position_embeddings
(and transformers does as well) I think we should update it!
@gante as you wrote this!
I believe the transformers implementation reads max_position_embeddings, and doesn't use original_max_position_embeddings for yarn.
We read original_max_position_embeddings
for other scaling, which is why I wonder why we don't for yarn
if that is the standards!
Yarn supposedly doesn't need it! It just needs the scale! And the scale seems to be a multiple of the base context (For instance, with Qwen 2, its a 4.0, with a base ctx of 32K and a max yarn ctx of 128K).
...And this is what piqued my interest. The scaling factor is seemingly static for any max_position_embeddings you set. For Qwen 2.5, for instance, it's ostensibly always 4.0 even if one only needs 64K context. But is changing the scaling factor with the max desired context more optimal?
In other words, maybe transformers should only read original_max_position_embeddings, and then compute the scale for yarn? I am testing if thats optimal over the next few days, but I'm not deeply familiar with how yarn is supposed to work.
Sorry for the lack of template, but this not so much a feature request or help request, but rather a request for "clarification" and a possible config/implementation bug with Qwen 2.5.
The Qwen 2.5 series of LLMS are trained with YaRN support, and they mention this in the model page: https://huggingface.co/Qwen/Qwen2.5-32B-Instruct#processing-long-texts
But it appears that "original_max_position_embeddings" is a parameter that only vllm reads. Huggingface transformers does not appear to use this variable, but rather reads 'max_position_embeddings' and uses that in the YaRN config:
https://github.com/huggingface/transformers/blob/2e24ee4dfa39cc0bc264b89edbccc373c8337086/src/transformers/modeling_rope_utils.py#L192
So lets say I want to use Qwen 2.5 at 64K context length optimally... what exacty do I change in the config.json?
Do I leave max_position_embeddings at 32K, at "override" it later? Or do I change it in the config, and let the YaRN implementation read the new max_position_embeddings.
Do I change the scaling factor? For instance, should it be 2.0 instead of 4.0 if the original ctx is 32K and the max desired context is 64K?
And is this factor not updated dynamically? The Qwen 2.5 page implies that transformers (unlike vllm) supports a "non static" YaRN factor, but the transformers function specifically mentions that seq_len is "Unused for this type of RoPE." Does this mean it should take the max desired context, and thats its form of dynamism?