ml-explore / mlx-examples

Examples in the MLX framework
MIT License
5.84k stars 830 forks source link

Loss nan for phi-3 #718

Closed l0d0v1c closed 4 months ago

l0d0v1c commented 4 months ago

When I try to finetune phi-3 (Phi-3-mini-128k-instruct-8bit) I get the same issue I previously had for mixtral with a loss nan

Trainable parameters: 0.042% (1.573M/3750.282M) Loading datasets Training Starting training..., iters: 100 Iter 1: Val loss nan, Val took 61.907s Iter 10: Train loss nan, Learning Rate 1.000e-05, It/sec 0.264, Tokens/sec 530.838, Trained Tokens 20105, Peak mem 16.146 GB

alexC-nonsense4k commented 4 months ago

May I ask you if you fine-tune by writing your own code or directly use mlx_lm for fine-tuning? I’ve been recently writing a model file for phi3, perhaps I could help you solve this issue.

l0d0v1c commented 4 months ago

Thanks AlexC I used python mlx_lm.lora -m ... The generate module works fine but not lora

alexC-nonsense4k commented 4 months ago

May I ask about your specific config file or config settings when fine-tuning with lora? I tried using Phi-3-mini-4k-instruct for lora fine-tuning today, but the loss is normal.

awni commented 4 months ago

Hi this should be fixed in the latest MLX https://github.com/ml-explore/mlx/pull/1028. There was an issue with quantizing all 0s which produced NaNs and which was exposed in Phi3.

Note to get it to work we will need to re-quantize Phi3 from the original weights using the latest MLX (so either build from source or wait till we release).

kishoretvk commented 4 months ago

May I ask about your specific config file or config settings when fine-tuning with lora? I tried using Phi-3-mini-4k-instruct for lora fine-tuning today, but the loss is

May I ask about your specific config file or config settings when fine-tuning with lora? I tried using Phi-3-mini-4k-instruct for lora fine-tuning today, but the loss is normal.

Could you share your congi and loss metrics, LORA > LORA py Errors out

l0d0v1c commented 4 months ago

@kishoretvk last update solved the issue