jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.31k stars 426 forks source link

Saturation / epoch-accuracy plot #132

Closed rasbt closed 6 months ago

rasbt commented 6 months ago

Thanks for sharing this awesome work (and the paper write-up)! I was wondering if you by chance have a plot similar to the one from the Pythia paper but for all 3 epochs. If so, that would be super interesting and intriguing.

Pythia_saturation
jzhang38 commented 6 months ago

We have one in https://arxiv.org/pdf/2401.02385.pdf Figure 2. Do note that two bugs were found during the run: https://whimsical-aphid-86d.notion.site/Release-of-TinyLlama-1-5T-Checkpoints-Postponed-01b266998c1c47f78f5ae1520196d194?pvs=4 and https://whimsical-aphid-86d.notion.site/Latest-Updates-from-TinyLlama-Team-7d30c01fff794da28ccc952f327c8d4f?pvs=4. So we may not draw conclusive result.

rasbt commented 6 months ago

Thanks, I was a bit confused by this and thought this was something different. Figure 1 shows 3456 GPU hours for TinyLlama, which I assume is for 1 epoch? The 10^4 mark in Figure 2 would then correspond to ~3 epochs?

jzhang38 commented 5 months ago
image

So sorry about the confusion in the report... We forget to cite the figure in this paragraph.

RonanKMcGovern commented 5 months ago

Thanks for sharing this awesome work (and the paper write-up)! I was wondering if you by chance have a plot similar to the one from the Pythia paper but for all 3 epochs. If so, that would be super interesting and intriguing.

@rasbt what's your takeaway when you consider the Pythia work combined with this TinyLlama work?

rasbt commented 5 months ago

Thanks for clarifying @jzhang38 . If I understand it correctly now, is the following a correct assumption?

Screenshot 2024-01-11 at 10 13 38 AM

@rasbt what's your takeaway when you consider the Pythia work combined with this TinyLlama work?

It looks like there's definitely an improvement due to architecture changes and perhaps the dataset :)

Green-Sky commented 5 months ago

@rasbt you did not respect the log scaling of the x-axis.