huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.09k stars 105 forks source link

multi-node pp hang when enable gradient accumulation #209

Open yuuxiaooqingg opened 1 month ago

yuuxiaooqingg commented 1 month ago

I tested llama3 continue training with multi-machine tp4 pp2 dp2. If I enabled grad accum operation, the training would hang. The experimental environment is: 16H800 torch 2.1.2+cu121.

checkpoints: checkpoint_interval: 200 checkpoints_path: exps/llama3_ct/ckpts checkpoints_path_is_shared_file_system: true resume_checkpoint_path: pretrained_models/Meta-Llama-3-8B save_initial_state: false no_load_optim: true

data_stages:

yuuxiaooqingg commented 1 month ago

@xrsrke Do you know what might be causing this?

xrsrke commented 1 month ago

@yuuxiaooqingg Hello. At which step did it hang?

Pclanglais commented 2 weeks ago

Same issue here. It hangs just before starting. Seems like gradient communication is not going well…

(tested on 48x4 h100s)

Pclanglais commented 1 week ago

Not sure if it applies in this case but I've found a fix: disable zero stage entirely. Zero stage 1 works somewhat with replicas in a smalll number of nodes setting (not 2 nor 3). Stops working as well with large distributed training sets.