huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.2k stars 26.6k forks source link

[Feature Ambiguity] How exactly do I activate YaRN for Qwen 2.5 and similar models? Is the implementation misconfigured? #33783

Open Downtown-Case opened 1 week ago

Downtown-Case commented 1 week ago

Sorry for the lack of template, but this not so much a feature request or help request, but rather a request for "clarification" and a possible config/implementation bug with Qwen 2.5.

The Qwen 2.5 series of LLMS are trained with YaRN support, and they mention this in the model page: https://huggingface.co/Qwen/Qwen2.5-32B-Instruct#processing-long-texts

Processing Long Texts

The current config.json is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to config.json to enable YaRN:

{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

But it appears that "original_max_position_embeddings" is a parameter that only vllm reads. Huggingface transformers does not appear to use this variable, but rather reads 'max_position_embeddings' and uses that in the YaRN config:

https://github.com/huggingface/transformers/blob/2e24ee4dfa39cc0bc264b89edbccc373c8337086/src/transformers/modeling_rope_utils.py#L192

So lets say I want to use Qwen 2.5 at 64K context length optimally... what exacty do I change in the config.json?

Do I leave max_position_embeddings at 32K, at "override" it later? Or do I change it in the config, and let the YaRN implementation read the new max_position_embeddings.

Do I change the scaling factor? For instance, should it be 2.0 instead of 4.0 if the original ctx is 32K and the max desired context is 64K?

And is this factor not updated dynamically? The Qwen 2.5 page implies that transformers (unlike vllm) supports a "non static" YaRN factor, but the transformers function specifically mentions that seq_len is "Unused for this type of RoPE." Does this mean it should take the max desired context, and thats its form of dynamism?

LysandreJik commented 1 week ago

cc @ArthurZucker @gante

ArthurZucker commented 5 days ago

I am not sure if it's a bug, the doc is probably wrong as for transformers specifically you need max_position_embeddings. If VLLM reads origina_max_position_embeddings (and transformers does as well) I think we should update it!

@gante as you wrote this!

Downtown-Case commented 5 days ago

I believe the transformers implementation reads max_position_embeddings, and doesn't use original_max_position_embeddings for yarn.

ArthurZucker commented 4 days ago

We read original_max_position_embeddings for other scaling, which is why I wonder why we don't for yarn if that is the standards!

Downtown-Case commented 4 days ago

Yarn supposedly doesn't need it! It just needs the scale! And the scale seems to be a multiple of the base context (For instance, with Qwen 2, its a 4.0, with a base ctx of 32K and a max yarn ctx of 128K).

...And this is what piqued my interest. The scaling factor is seemingly static for any max_position_embeddings you set. For Qwen 2.5, for instance, it's ostensibly always 4.0 even if one only needs 64K context. But is changing the scaling factor with the max desired context more optimal?

In other words, maybe transformers should only read original_max_position_embeddings, and then compute the scale for yarn? I am testing if thats optimal over the next few days, but I'm not deeply familiar with how yarn is supposed to work.