idea: Precision scaling research

janhq / ichigo

Local realtime voice AI

Apache License 2.0

1.96k stars 98 forks source link

idea: Precision scaling research #127

Open hahuyhoang411 opened 1 day ago

hahuyhoang411 commented 1 day ago

Problem Statement

Hypothesis: Increasing numerical precision during training can improve the performance of small language models (≈1B parameters), potentially enabling them to achieve capabilities comparable to larger models (3B-7B parameters).

Implications

If validated, this hypothesis could:

Reduce the computational resources needed for training effective language models
Enable broader adoption of smaller, more efficient models
Lead to new approaches in optimizer design and implementation

Idea

Reference: https://arxiv.org/pdf/2411.04330

hahuyhoang411 commented 1 day ago

@bachvudinh as co-author please help me to add some exploded attempts and MMLU score

We have ran some testing to train Llama 3.2 1B Instruct to check if fp32 can perform better bf16

Precision	Learning Rate	Weight Decay	Global batchsize	Trained Samples	Final Loss	MMLU
fp32	3e-4	0.01	96	0.2M	1.24
fp32	2.5e-4	0.01	96	0.2M	1.22
bf16	3e-4	0.01	96	0.2M	exploded
bf16	2e-4	0.01	96	0.2M	exploded
b16	2.5e-4	0.01	96	0.2M	1.26
fp32	3e-4	0.2	?	0.5M	0.67	25.54
fp32	1e-4	0.05	?	1.7M	1.32	23.18

for the training configs with fp32 and setting lr as 1e-4 and weight decay as 0.05 , there are some weird mmlu results with checkpoint step 1000, 2000 and 3000:

step 1000:
step 2000:
step 3000:

hahuyhoang411 commented 1 day ago

fp32 3e-4

hahuyhoang411 commented 1 day ago

fp32 2.5e-4

hahuyhoang411 commented 1 day ago

bf16 2.5e-4

hahuyhoang411 commented 1 day ago

fp32 0.5M

hahuyhoang411 commented 1 day ago

fp32 1.7M

tikikun commented 1 day ago

A few pending issues:

It's very obvious that even tho more stabilized the training is not converging and still have hiccup, which indicating issue inside the optimizer itself not being able to give direction good enough for the optimizing process
Since we are currently hitting a wall on the optimizer itself, we will not continue scaling up the precision

Next steps:

@tikikun to do his own research on optimizer
We make use of the cluster for training qwen 32b instruct

cc @0xSage for interested