Open mmmans opened 1 year ago
Plain NTK-aware scaling performs poorly when fine-tuning, so this isn't an apples-to-apples comparison. From the paper:
However, one major disadvantage of this method is that given it is not just an interpolation scheme, some dimensions are slightly extrapolated to "out-of-bound" values, thus fine-tuning with "NTK-aware" interpolation [4] yields inferior results to PI [7].
I believe LlamaDynamicNTKScalingRotaryEmbedding in transformers is one of the better-performing methods without fine-tuning. I've been using a variant of it with llama.cpp, but it's inefficient due to the way llama.cpp implements KV caching. It may also be worth comparing LlamaPartNTKScaledRotaryEmbedding and LlamaDynamicPartNTKScaledRotaryEmbedding.
YaRN has two hyperparameters that should be tweaked depending on the model's architecture. The self.beta_slow=32
constant and the self.mscale
function might not be optimal for Baichuan models. If those variables are set to an non-optimal value, it is possible for YaRN to be inferior to "NTK-aware" scaling.
You can try tuning the attn_factor=1
variable as a first-order scaling of the mscale
value. (try setting it at 0.9 or 1.1, and check if PPL is better)
When we tested YaRN on LLaMA and Llama 2 models, it performed better than any of our previous methods in both finetuned and non-finetuned scenarios.
When we tested YaRN on LLaMA and Llama 2 models, it performed better than any of our previous methods in both finetuned and non-finetuned scenarios.
Good to know!
YaRN has two hyperparameters that should be tweaked depending on the model's architecture. The
self.beta_slow=32
constant and theself.mscale
function might not be optimal for Baichuan models. If those variables are set to an non-optimal value, it is possible for YaRN to be inferior to "NTK-aware" scaling.You can try tuning the
attn_factor=1
variable as a first-order scaling of themscale
value. (try setting it at 0.9 or 1.1, and check if PPL is better)When we tested YaRN on LLaMA and Llama 2 models, it performed better than any of our previous methods in both finetuned and non-finetuned scenarios.
when yarn will be available on Axotol Frame?
YaRN has two hyperparameters that should be tweaked depending on the model's architecture. The
self.beta_slow=32
constant and theself.mscale
function might not be optimal for Baichuan models. If those variables are set to an non-optimal value, it is possible for YaRN to be inferior to "NTK-aware" scaling.You can try tuning the
attn_factor=1
variable as a first-order scaling of themscale
value. (try setting it at 0.9 or 1.1, and check if PPL is better)When we tested YaRN on LLaMA and Llama 2 models, it performed better than any of our previous methods in both finetuned and non-finetuned scenarios.
Thanks! I have finetuned the models and yarn prevails the ntk-variant method. But there are some loss on the common benchmark such as MMLU etc.
Finetuning longer can conpensate the loss in common benchmark. But the performance on the long context task drops.
Is there any way to alleviate the degragation in common benchmark while maintain the ability of processing long context task ?
Thanks! I have finetuned the models and yarn prevails the ntk-variant method. But there are some loss on the common benchmark such as MMLU etc. Finetuning longer can conpensate the loss in common benchmark. But the performance on the long context task drops. Is there any way to alleviate the degragation in common benchmark while maintain the ability of processing long context task ?
What dataset, scale $s$, training steps, optimizer are you using? It would be hard to diagnose drop in performance during finetuning without those.
Can I ask what is the dataset you used to train Baichuan2? Is it available on Huggingface or somewhere else?
self.beta_slow
我看默认的参数如下,self.beta_slow=32是正确的吗?extrapolation_factor=1, attn_factor=1, beta_fast=32, beta_slow=1这四个参数应该设置为什么?
class LlamaDynamicYaRNScaledRotaryEmbedding(torch.nn.Module):
def __init__(self, dim, max_position_embeddings=2048, base=10000, original_max_position_embeddings=2048, extrapolation_factor=1, attn_factor=1, beta_fast=32, beta_slow=1, finetuned=False, device=None):
model info
Did this phenomenon oberserved in your experiments? In short context-window: Ntk > Yarn In longer context-window Yarn > Ntk