turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.71k stars 283 forks source link

Add YaRN scaling for Qwen 2.5 #642

Closed Downtown-Case closed 1 month ago

Downtown-Case commented 1 month ago

As described here:

https://huggingface.co/Qwen/Qwen2.5-32B-Instruct#processing-long-texts

I adapted transformer's YaRN implementation. Note you must manually add this to the model config for the YaRN scaling to kick in!

{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

Relevant issue: https://github.com/turboderp/exllamav2/issues/641

Seems to work in exui and Qwen 2.5 32B Instruct, trying it with: https://huggingface.co/DrNicefellow/Qwen2.5-32B-Instruct-4.25bpw-exl2

It repeats like crazy at 80K context without this PR, and seems quite coherent with it, but consider this a WIP testing PR! I'm just an end user.

Downtown-Case commented 1 month ago

Testing in exui, I noticed that it only calculates the inv_scaling factor once, for the first generation, and never touches this code again.

Does a trigger need to be placed somewhere to recalculate the scaling factor for every generation, or is this working as intended? It doesn't actually use the current context length anywhere in this (or the transformers) code.

Downtown-Case commented 1 month ago

I'm still trying to wrap my head around the transformers/vllm implementations and what qwen put in the config... original_max_position_embeddings is not actually used in transformers, but rather is only read in vllm to derive a max context: original_max_position_embeddings

I guess the assumption is that users will not change max_pos_embeddings, but simply add that yarn field and have vllm silently set a 128K context... but if they do change it in the config, it's there just in case?

Is max_pos_embeddings in the code supposed to be 32K, since that's how it's configured by default in Qwen instruct when transformers loads it? https://github.com/huggingface/transformers/blob/2e24ee4dfa39cc0bc264b89edbccc373c8337086/src/transformers/modeling_rope_utils.py#L192

Downtown-Case commented 1 month ago

I'm overthinking this. It appears transformers simply grabs max_position_embeddings and calculates static scaling factors from that, which must the the "dynamism" the Qwen team refers to instead of just multiplying the base ctx by the scaling factor like vllm does.

And that static factor transformers uses is apparently good enough without having to recalculate it for different sequence lengths? As the paper says:

"Furthermore, it has zero overhead during both inference and training, as RoPE embeddings are generated in advance and are reused for all forward passes."