pytorch / PiPPy

Pipeline Parallelism for PyTorch
BSD 3-Clause "New" or "Revised" License
726 stars 86 forks source link

DDP + CUDA gives "Gradients not close" #165

Open kwen2501 opened 2 years ago

kwen2501 commented 2 years ago

Seen at 8d9770 (may occur earlier) Intermittent.

Test:

python local_test_ddp.py

Log:

Traceback (most recent call last):
  File "/home/kw2501/PiPPy/test/local_test_ddp.py", line 245, in <module>
    mp.spawn(run_worker, args=(args.world_size, args,), nprocs=args.world_size, join=True)
  File "/home/kw2501/pytorch/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/kw2501/pytorch/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/kw2501/pytorch/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/kw2501/pytorch/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/kw2501/PiPPy/test/local_test_ddp.py", line 213, in run_worker
    run_master(args, pp_ranks_per_dp_group[rank])
  File "/home/kw2501/PiPPy/test/local_test_ddp.py", line 167, in run_master
    raise AssertionError(f'Gradients not close: {not_close_grads}')
AssertionError: Gradients not close: ['split_gm.submod_0.moved_module_mm_param', 'split_gm.submod_1.moved_module_lin_w']
jamesr66a commented 2 years ago

Repro with slurm:

local_test_wrapper.sh

#! /bin/bash
export MASTER_PORT=29500
export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
export LOCAL_RANK=${SLURM_LOCALID}
export CUDA_VISIBLE_DEVICES=${SLURM_LOCALID}
export WORLD_SIZE=${SLURM_NTASKS}
export RANK=${SLURM_PROCID}

python -u test/local_test_ddp.py
srun -N2 -p train --gpus-per- --ntasks-per-node=6 --gpus-per-task=1 ./local_test_wrapper.sh
jamesr66a commented 2 years ago

This is probably a synchronization issue between DDP and _sync_replicated_params: https://github.com/pytorch/PiPPy/blob/eeed967acad5675523935b1e28f42113c6eb7540/pippy/PipelineDriver.py#L1009