Open hahuyhoang411 opened 1 day ago
@bachvudinh as co-author please help me to add some exploded attempts and MMLU score
We have ran some testing to train Llama 3.2 1B Instruct to check if fp32
can perform better bf16
Precision | Learning Rate | Weight Decay | Global batchsize | Trained Samples | Final Loss | MMLU |
---|---|---|---|---|---|---|
fp32 | 3e-4 | 0.01 | 96 | 0.2M | 1.24 | |
fp32 | 2.5e-4 | 0.01 | 96 | 0.2M | 1.22 | |
bf16 | 3e-4 | 0.01 | 96 | 0.2M | exploded | |
bf16 | 2e-4 | 0.01 | 96 | 0.2M | exploded | |
b16 | 2.5e-4 | 0.01 | 96 | 0.2M | 1.26 | |
fp32 | 3e-4 | 0.2 | ? | 0.5M | 0.67 | 25.54 |
fp32 | 1e-4 | 0.05 | ? | 1.7M | 1.32 | 23.18 |
for the training configs with fp32 and setting lr as 1e-4 and weight decay as 0.05 , there are some weird mmlu results with checkpoint step 1000, 2000 and 3000:
step 1000:
step 2000:
step 3000:
fp32 3e-4
fp32 2.5e-4
bf16 2.5e-4
fp32 0.5M
fp32 1.7M
A few pending issues:
Next steps:
cc @0xSage for interested
Problem Statement
Hypothesis: Increasing numerical precision during training can improve the performance of small language models (≈1B parameters), potentially enabling them to achieve capabilities comparable to larger models (3B-7B parameters).
Implications
If validated, this hypothesis could:
Idea
Reference: https://arxiv.org/pdf/2411.04330