Reproducing SFT results.

tcapelle commented 11 months ago

I was looking at the logs of your training (from this json file) and realized that the scheduling is messed up.

It's related to the ConstantLength dataset, not computing its actual length. When I train this model, the progress bar and the total number of iterations are calculated from the underlying H4 Dataset (around 208k samples) instead of the packed version that has around 139k packed sequences of 2048. This affects the scheduler, which does not perform any warmup. I have an 8xA100 node, so I am running 2x grad accum for an adequate batch size 512.

I am sure you are missing a warmup_ratio: 0.1 on the sft configs

~It would be beneficial to have access to the training logs.~ I found them on Tensorboard :(

You can follow my training here: https://wandb.ai/capecape/zephyr/runs/zhfrhnr5

PD: When using trl, I manually compute the total number of train steps beforehand to adequately pass the warmup steps to the scheduler. I know the ConstantLength dataset is a generator that yields batches without knowing beforehand how many samples it will have.

edbeeching commented 11 months ago

Thanks, perhaps we missed the warmup_steps when we copied the config over from our internal repo @lewtun ?

Yes, there is a known bug with the constant length dataset: https://github.com/huggingface/trl/issues/943 I plan on fixing it next week.

tcapelle commented 11 months ago

I re-run both SFT and DPO here in case you want to check. I manually passed the total training steps and added the missing warmup. Also if you look at the logs from the SFT train it stopped earlier (the full cosine was cut at 0.67 epoch due to this)

https://wandb.ai/capecape/zephyr?workspace=user-capecape

jwkirchenbauer commented 9 months ago

Fwiw I tried running the SFT example the other day, and saw this same issue. @lvwerra suggested that this should be fixed by https://github.com/huggingface/trl/pull/979. You need to bump up to trl==0.7.5 or newer to get the packed dataset length fix.

huggingface / alignment-handbook

Reproducing SFT results. #27