janhq / ichigo

Local realtime voice AI
Apache License 2.0
1.96k stars 98 forks source link

idea: Precision scaling research #127

Open hahuyhoang411 opened 1 day ago

hahuyhoang411 commented 1 day ago

Problem Statement

Hypothesis: Increasing numerical precision during training can improve the performance of small language models (≈1B parameters), potentially enabling them to achieve capabilities comparable to larger models (3B-7B parameters).

Implications

If validated, this hypothesis could:

Idea

Reference: https://arxiv.org/pdf/2411.04330

hahuyhoang411 commented 1 day ago

@bachvudinh as co-author please help me to add some exploded attempts and MMLU score

We have ran some testing to train Llama 3.2 1B Instruct to check if fp32 can perform better bf16

Precision Learning Rate Weight Decay Global batchsize Trained Samples Final Loss MMLU
fp32 3e-4 0.01 96 0.2M 1.24    
fp32 2.5e-4 0.01 96 0.2M 1.22  
bf16 3e-4 0.01 96 0.2M exploded  
bf16 2e-4 0.01 96 0.2M exploded  
b16 2.5e-4 0.01 96 0.2M 1.26  
fp32 3e-4 0.2 ? 0.5M 0.67   25.54
fp32 1e-4 0.05 ? 1.7M 1.32   23.18

for the training configs with fp32 and setting lr as 1e-4 and weight decay as 0.05 , there are some weird mmlu results with checkpoint step 1000, 2000 and 3000:

hahuyhoang411 commented 1 day ago

fp32 3e-4

Screenshot 2024-11-21 at 01 09 02
hahuyhoang411 commented 1 day ago

fp32 2.5e-4

Screenshot 2024-11-21 at 01 09 49
hahuyhoang411 commented 1 day ago

bf16 2.5e-4

Screenshot 2024-11-21 at 01 10 39
hahuyhoang411 commented 1 day ago

fp32 0.5M

Screenshot 2024-11-21 at 01 19 44
hahuyhoang411 commented 1 day ago

fp32 1.7M

Screenshot 2024-11-21 at 01 20 14
tikikun commented 1 day ago

A few pending issues:

Next steps:

cc @0xSage for interested