Open ghost opened 1 year ago
8x80GB GPUs would be enough for 7b models, however I do not know if 70B would fit on the 4xA100 nodes... Pinging @jquesnelle and @conceptofmind
It all depends on how much effort we can do to write the distributed training code (and how long we are willing to wait)
8x80GB GPUs would be enough for 7b models, however I do not know if 70B would fit on the 4xA100 nodes... Pinging @jquesnelle and @conceptofmind
It all depends on how much effort we can do to write the distributed training code (and how long we are willing to wait)
It can be done through proper parallelization. We were limited to what we could use on the Stability AI due to both potential intellectual property constraints and lack of computing. If those are adequately taken into consideration through other sponsors then we should be able to build a 70B model at longer context lengths (8k-128k) without any issues.
I am currently communicating with LAION and Together. We should seek every possible grant available.
any plans to implement yarn into llama.cpp? need to show poc to potential pi for smaller models
any plans to implement yarn into llama.cpp
It could be built off of https://github.com/ggerganov/llama.cpp/pull/2268 which was based on the code in this repo, but it was written before the paper came out and I haven't had a chance to read it.
any plans to implement yarn into llama.cpp
It could be built off of ggerganov/llama.cpp#2268 which was based on the code in this repo, but it was written before the paper came out and I haven't had a chance to read it.
YaRN is just like NTK-by-parts as implemented in your implementation, but without the "gamma" factors (thus no more base change), plus an additional self.mscale
factor that you multiply the RoPE embeddings with:
self.register_buffer("cos_cached", (emb.cos() * self.mscale)[None, None, :, :].to(dtype), persistent=False)
self.register_buffer("sin_cached", (emb.sin() * self.mscale)[None, None, :, :].to(dtype), persistent=False)
https://github.com/jquesnelle/yarn/blob/master/scaled_rope/LlamaYaRNScaledRotaryEmbedding.py
We've intentionally made YaRN as simple as possible to implement. (by ablating everything that had a negligible effect after the finetune)
Not easy finding a PI, any good ideas for putting together a PowerPoint?
I am curious what is required to apply this method to the 70B parameter version of the llama2 model? On reddit, noticed you mention: "For training, these models barely fit in 128 80GB A100s using DeepSpeed and FA2" Would the computer at OSC be enough? https://www.osc.edu/resources/technical_support/supercomputers/ascend Only 96 80GB A100 GPUs: Is that enough to contribute to the SoTA (State of the art)?