jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.31k stars 426 forks source link

Results vs Chinchilla #123

Closed SpaceCowboy850 closed 6 months ago

SpaceCowboy850 commented 6 months ago

Do you plan on doing a writeup of your findings?

It would appear that you reached saturation sometime between 2.5T and 3T tokens, but doesn't that blow away the typical ~20x differential that Chinchilla found?

jzhang38 commented 6 months ago

For this run, two bugs 1, 2 was identified in the course of training so we can not draw a conclusion. We are launching a new training run and will discuss this once that run is finished.

Green-Sky commented 6 months ago

@jzhang38 Are you are going to fix softmax for your next run ? ( https://www.evanmiller.org/attention-is-off-by-one.html )

[...] traced the existence of these outlier values to the attention mechanism’s softmax function, [...] finding the off-by-one error [...] .

and corresponding https://github.com/softmax1/Flash-Attention-Softmax-N

edit: more on the topic https://datasciencecastnet.home.blog/2023/08/04/exploring-softmax1-or-community-research-for-the-win/ https://wandb.ai/capecape/llamac/reports/Training-Tiny-Llamas-for-fun-and-science--Vmlldzo1MDM2MDg0

abodacs commented 6 months ago

@jzhang38 I heard those tricks make training faster

The Multipack sampler is designed for padding-free distributed training of large language models

https://github.com/imoneoi/multipack_sampler

jzhang38 commented 6 months ago

@Green-Sky Yeah thanks for the info and I also read that blog! My concern over this change is: 1. This is still not a widely adopted practice. 2. It will make TinyLlama incompatible with Llama and thus hard to apply TinyLlama on many other frameworks.

Green-Sky commented 6 months ago

@jzhang38

  1. It will make TinyLlama incompatible with Llama and thus hard to apply TinyLlama on many other frameworks.

This really depends on how its done. You could alternatively prefix/postfix a frozen 0, which could be written to file (slightly wasteful and needs to be optimized in code).

  1. This is still not a widely adopted practice.

since you seem to have the funding ... Why don't you perform a "surgery" as described in the ported flash attention N repository? Also, you stated that you would have more analysis in later papers, this would be ideal for an outlier section :)