Yarn gets worse results than NTK-aware-scaling policy, under non-fine-tuned scenarios

jquesnelle / yarn

YaRN: Efficient Context Window Extension of Large Language Models

MIT License

1.32k stars 115 forks source link

Yarn gets worse results than NTK-aware-scaling policy, under non-fine-tuned scenarios #19

Open mmmans opened 1 year ago

mmmans commented 1 year ago

model info

base-model : baichuan-7b
base-context-size : 4096

Did this phenomenon oberserved in your experiments? In short context-window: Ntk > Yarn In longer context-window Yarn > Ntk

cebtenzzre commented 1 year ago

Plain NTK-aware scaling performs poorly when fine-tuning, so this isn't an apples-to-apples comparison. From the paper:

However, one major disadvantage of this method is that given it is not just an interpolation scheme, some dimensions are slightly extrapolated to "out-of-bound" values, thus fine-tuning with "NTK-aware" interpolation [4] yields inferior results to PI [7].

I believe LlamaDynamicNTKScalingRotaryEmbedding in transformers is one of the better-performing methods without fine-tuning. I've been using a variant of it with llama.cpp, but it's inefficient due to the way llama.cpp implements KV caching. It may also be worth comparing LlamaPartNTKScaledRotaryEmbedding and LlamaDynamicPartNTKScaledRotaryEmbedding.

bloc97 commented 1 year ago

YaRN has two hyperparameters that should be tweaked depending on the model's architecture. The self.beta_slow=32 constant and the self.mscale function might not be optimal for Baichuan models. If those variables are set to an non-optimal value, it is possible for YaRN to be inferior to "NTK-aware" scaling.

You can try tuning the attn_factor=1 variable as a first-order scaling of the mscale value. (try setting it at 0.9 or 1.1, and check if PPL is better)

When we tested YaRN on LLaMA and Llama 2 models, it performed better than any of our previous methods in both finetuned and non-finetuned scenarios.

cebtenzzre commented 1 year ago

When we tested YaRN on LLaMA and Llama 2 models, it performed better than any of our previous methods in both finetuned and non-finetuned scenarios.

Good to know!

xiechengmude commented 1 year ago

YaRN has two hyperparameters that should be tweaked depending on the model's architecture. The self.beta_slow=32 constant and the self.mscale function might not be optimal for Baichuan models. If those variables are set to an non-optimal value, it is possible for YaRN to be inferior to "NTK-aware" scaling.

You can try tuning the attn_factor=1 variable as a first-order scaling of the mscale value. (try setting it at 0.9 or 1.1, and check if PPL is better)

When we tested YaRN on LLaMA and Llama 2 models, it performed better than any of our previous methods in both finetuned and non-finetuned scenarios.

when yarn will be available on Axotol Frame?

mmmans commented 1 year ago

YaRN has two hyperparameters that should be tweaked depending on the model's architecture. The self.beta_slow=32 constant and the self.mscale function might not be optimal for Baichuan models. If those variables are set to an non-optimal value, it is possible for YaRN to be inferior to "NTK-aware" scaling.

You can try tuning the attn_factor=1 variable as a first-order scaling of the mscale value. (try setting it at 0.9 or 1.1, and check if PPL is better)

When we tested YaRN on LLaMA and Llama 2 models, it performed better than any of our previous methods in both finetuned and non-finetuned scenarios.

Thanks! I have finetuned the models and yarn prevails the ntk-variant method. But there are some loss on the common benchmark such as MMLU etc.
Finetuning longer can conpensate the loss in common benchmark. But the performance on the long context task drops. Is there any way to alleviate the degragation in common benchmark while maintain the ability of processing long context task ?

bloc97 commented 1 year ago

Thanks! I have finetuned the models and yarn prevails the ntk-variant method. But there are some loss on the common benchmark such as MMLU etc. Finetuning longer can conpensate the loss in common benchmark. But the performance on the long context task drops. Is there any way to alleviate the degragation in common benchmark while maintain the ability of processing long context task ?

What dataset, scale $s$, training steps, optimizer are you using? It would be hard to diagnose drop in performance during finetuning without those.

tuzeao-tal commented 11 months ago

Can I ask what is the dataset you used to train Baichuan2? Is it available on Huggingface or somewhere else?

IT-five commented 9 months ago

self.beta_slow

我看默认的参数如下，self.beta_slow=32是正确的吗？extrapolation_factor=1, attn_factor=1, beta_fast=32, beta_slow=1这四个参数应该设置为什么？

class LlamaDynamicYaRNScaledRotaryEmbedding(torch.nn.Module):
    def __init__(self, dim, max_position_embeddings=2048, base=10000, original_max_position_embeddings=2048, extrapolation_factor=1, attn_factor=1, beta_fast=32, beta_slow=1, finetuned=False, device=None):