turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

YaRN Support #272

Open grimulkan opened 10 months ago

grimulkan commented 10 months ago

Any thoughts/plans about YaRN support for the positional embeddings? https://github.com/jquesnelle/yarn

I don't actually see them beat regular linear scaling w/ fine-tuning in the paper, but presumably it extends beyond the fine-tuned context length without breaking and performs better than regular PI for shorter contexts.

I see GPTQ quants from TheBloke for some models trained with YaRN already. Not sure how those are supposed to be working without changing the way the positions are calculated.

I don't think what the authors call NTK-by-parts is supported by exllama either (YaRN is just a slight modification), so maybe there's something about this that makes it tricky to integrate?

This is all still static RoPE (set scale, base, alpha whatever at load time).

niceblue88 commented 10 months ago

I second this a lot. Being able to scale large context while keeping model vram use low is the most valuable part, almost all use cases scale in better usefulness with increasing context size. However, in terms of exllama, I think the positional embeddings are tightly bound with the great optimisations for speed and minimal vram ... which means its harder to make them dynamic with those optimisations, However, perhaps these can be broken out to allow for this?

turboderp commented 10 months ago

I'm still not sure what "dynamic" positional encodings actually means, and how you would use them with cached keys.

grimulkan commented 10 months ago

I am not sure we need them to be dynamic. YaRN works both ways? The static version I described above still computes the positional table once at tge start, just like exllama does today, as far as I understand.

grimulkan commented 10 months ago

By ‘dynamic’, the paper means something that changes the rope scaling depending on the actual context size (only compresses when context exceeds original pre-trained size). This is optional. They have this to say on caching under the dynamic implementation:

Some care has to be taken when using Dynamic Scaling with kv-caching [6], as in some implementations, the RoPE embeddings are cached. The correct implementation should cache the kv-embeddings before applying RoPE, as the RoPE embedding of every token changes when s changes.

My crude understanding (we can use each with or without the dynamic aspect above):

kaiokendev linear: always compress position by some factor. Works well when finetuned. NTK-alpha or Codellama style base change: change scale by a specific nonlinear equation (modifying either exponent or base). Not as good when finetuned (though maybe using a giant base like Meta did kinda works, but alpha doesn’t work as well as linear with finetuning) NTK-by-parts (not supported by transformers/exllama): has a different formula based on which hidden state dimension we are computing the position for. Presumably finetunes better than linear and extrapolates? YaRN: Above + scale attention by a constant factor depending on compression ratio

turboderp commented 10 months ago

The correct implementation should cache the kv-embeddings before applying RoPE, as the RoPE embedding of every token changes when s changes.

This is the part that doesn't make sense to me. I may just be failing to wrap my head around it, but as far as I can tell this would only work for the first layer. The state that exits the first layer is computed as a function of (among other things) those positional embeddings, and the keys and values in turn are produced from that state. So while you can save the keys/values from layer 2 without the embeddings from layer 2, they will still depend on the embeddings from layer 1. And so on throughout the model.

grimulkan commented 10 months ago

I see. Does that also mess with methods that change the position embeddings by hidden dimension like YaRN?

turboderp commented 10 months ago

I'm not sure what those do exactly, especially since the default RoPE implementation already adapts to the hidden dimension of the model. But the hidden dimension of the model is constant regardless of context size or position, so there shouldn't be an issue with the K/V cache.

grimulkan commented 10 months ago

From my limited understanding, the authors claim that trying to use NTK-alpha scaling effectively extrapolates some dimensions, unlike linear scaling which never does. This, they say, Is why it is hard to fine tune it (probably at best those dims become a noop). The modifications basically makes sure no dimension is extrapolated, using a piecewise calculation, which the simple exponential angle scaling equation of NTK doesn't do by default.

Both change by hidden dimension, this just uses a different equation.

That said, I haven't checked, but Meta's method of using a giant fixed base I think effectively avoids the extrapolation also, and the authors don't cover that (edit: ok they apparently do and claim better scaling by experiment). Like YaRN codellama also extrapolates well after fine-tuning, and is supported by exllama already.