Closed SpaceCowboy850 closed 6 months ago
@jzhang38 Are you are going to fix softmax for your next run ? ( https://www.evanmiller.org/attention-is-off-by-one.html )
[...] traced the existence of these outlier values to the attention mechanism’s softmax function, [...] finding the off-by-one error [...] .
and corresponding https://github.com/softmax1/Flash-Attention-Softmax-N
edit: more on the topic https://datasciencecastnet.home.blog/2023/08/04/exploring-softmax1-or-community-research-for-the-win/ https://wandb.ai/capecape/llamac/reports/Training-Tiny-Llamas-for-fun-and-science--Vmlldzo1MDM2MDg0
@jzhang38 I heard those tricks make training faster
Make your vocab size as a multiple of 64 (Andrej Karpathy says so!)
Multipack Sampler
The Multipack sampler is designed for padding-free distributed training of large language models
@Green-Sky Yeah thanks for the info and I also read that blog! My concern over this change is: 1. This is still not a widely adopted practice. 2. It will make TinyLlama incompatible with Llama and thus hard to apply TinyLlama on many other frameworks.
@jzhang38
- It will make TinyLlama incompatible with Llama and thus hard to apply TinyLlama on many other frameworks.
This really depends on how its done. You could alternatively prefix/postfix a frozen 0, which could be written to file (slightly wasteful and needs to be optimized in code).
- This is still not a widely adopted practice.
since you seem to have the funding ... Why don't you perform a "surgery" as described in the ported flash attention N repository? Also, you stated that you would have more analysis in later papers, this would be ideal for an outlier section :)
Do you plan on doing a writeup of your findings?
It would appear that you reached saturation sometime between 2.5T and 3T tokens, but doesn't that blow away the typical ~20x differential that Chinchilla found?