Open kwen2501 opened 2 years ago
Repro with slurm:
local_test_wrapper.sh
#! /bin/bash
export MASTER_PORT=29500
export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
export LOCAL_RANK=${SLURM_LOCALID}
export CUDA_VISIBLE_DEVICES=${SLURM_LOCALID}
export WORLD_SIZE=${SLURM_NTASKS}
export RANK=${SLURM_PROCID}
python -u test/local_test_ddp.py
srun -N2 -p train --gpus-per- --ntasks-per-node=6 --gpus-per-task=1 ./local_test_wrapper.sh
This is probably a synchronization issue between DDP and _sync_replicated_params
: https://github.com/pytorch/PiPPy/blob/eeed967acad5675523935b1e28f42113c6eb7540/pippy/PipelineDriver.py#L1009
Seen at 8d9770 (may occur earlier) Intermittent.
Test:
Log: