princeton-nlp / SimPO

SimPO: Simple Preference Optimization with a Reference-Free Reward
MIT License
626 stars 36 forks source link

'loss': 0.0, 'grad_norm': nan, and get #35

Closed wujia11 closed 1 month ago

wujia11 commented 2 months ago

{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.5784903139612557e-08, 'rewards/chosen': nan, 'rewards/rejected': nan, 'rewards/accuracies': 0.0, 'rewards/margins': nan, 'logps/rejected': nan, 'logps/chosen': nan, 'logits/rejected': nan, 'logits/chosen': nan, 'epoch': 0.01} When I am training, my loss is always equal to 0, does not change, and 'grad_norm': nan, Has anyone else encountered a similar situation? The following is my training configuration: bf16: true beta: 2.5 gamma: 1.4 do_eval: true evaluation_strategy: steps eval_steps: 400 gradient_accumulation_steps: 1 gradient_checkpointing: False gradient_checkpointing_kwargs: use_reentrant: False hub_model_id: simpo-exps learning_rate: 2e-7 log_level: info logging_steps: 2 lr_scheduler_type: cosine max_length: 1024 max_prompt_length: 1024 num_train_epochs: 1 Can you explain it to me and provide a solution? I would appreciate it.

lh0x00 commented 1 month ago

I think can should be setting max_prompt_length < max_length (e.g: max_length = max_prompt_length + gap).