jquesnelle / yarn

YaRN: Efficient Context Window Extension of Large Language Models
MIT License
1.25k stars 111 forks source link

Discussion: how to apply this experiment to the llama2 70B model? #11

Open ghost opened 10 months ago

ghost commented 10 months ago

I am curious what is required to apply this method to the 70B parameter version of the llama2 model? On reddit, noticed you mention: "For training, these models barely fit in 128 80GB A100s using DeepSpeed and FA2" Would the computer at OSC be enough? https://www.osc.edu/resources/technical_support/supercomputers/ascend Only 96 80GB A100 GPUs: Is that enough to contribute to the SoTA (State of the art)?

bloc97 commented 10 months ago

8x80GB GPUs would be enough for 7b models, however I do not know if 70B would fit on the 4xA100 nodes... Pinging @jquesnelle and @conceptofmind

It all depends on how much effort we can do to write the distributed training code (and how long we are willing to wait)

conceptofmind commented 10 months ago

8x80GB GPUs would be enough for 7b models, however I do not know if 70B would fit on the 4xA100 nodes... Pinging @jquesnelle and @conceptofmind

It all depends on how much effort we can do to write the distributed training code (and how long we are willing to wait)

It can be done through proper parallelization. We were limited to what we could use on the Stability AI due to both potential intellectual property constraints and lack of computing. If those are adequately taken into consideration through other sponsors then we should be able to build a 70B model at longer context lengths (8k-128k) without any issues.

I am currently communicating with LAION and Together. We should seek every possible grant available.

ghost commented 10 months ago

any plans to implement yarn into llama.cpp? need to show poc to potential pi for smaller models

cebtenzzre commented 10 months ago

any plans to implement yarn into llama.cpp

It could be built off of https://github.com/ggerganov/llama.cpp/pull/2268 which was based on the code in this repo, but it was written before the paper came out and I haven't had a chance to read it.

bloc97 commented 10 months ago

any plans to implement yarn into llama.cpp

It could be built off of ggerganov/llama.cpp#2268 which was based on the code in this repo, but it was written before the paper came out and I haven't had a chance to read it.

YaRN is just like NTK-by-parts as implemented in your implementation, but without the "gamma" factors (thus no more base change), plus an additional self.mscale factor that you multiply the RoPE embeddings with:

self.register_buffer("cos_cached", (emb.cos() * self.mscale)[None, None, :, :].to(dtype), persistent=False)
self.register_buffer("sin_cached", (emb.sin() * self.mscale)[None, None, :, :].to(dtype), persistent=False)

https://github.com/jquesnelle/yarn/blob/master/scaled_rope/LlamaYaRNScaledRotaryEmbedding.py

We've intentionally made YaRN as simple as possible to implement. (by ablating everything that had a negligible effect after the finetune)

ghost commented 9 months ago

Not easy finding a PI, any good ideas for putting together a PowerPoint?