Closed jpgard closed 7 months ago
A couple of additional comments:
export LAUNCHER="NCCL_DEBUG=INFO torchrun --nproc_per_node=$GPUS_PER_NODE --nnodes=$NNODES --node_rank=\$SLURM_PROCID --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT "
) but it raises the same errorWhat kind of GPUs are these?
They are 40GB A100s, in nodes of 8
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi, I have the same problem. Did you find any way to fix it?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I'm attempting to train a model with multi-node training, using SLURM scheduler. I am launching the job in 2 nodes with 8 GPUs each. My training script runs fine in a single-node environment with FSDP, and it starts fine in the multi-node setting -- until there is actual communication required between the nodes.
However, when the script gets to the parts that actually initialize multi-node training, it seems the processes are having issues communicating across nodes. I can see the logging output from all 16 processes, the data is loaded, etc. However, the script fails at
accelerator.prepare()
.Specifically I see the stack trace containing these lines (complete stack trace is below):
Note that it is possible I have misconfigured the accelerate config, or the SLURM settings (tasks/node counts etc), but based on the example here with corresponding FSDP config here this seems to be set up correctly to me.
Any thoughts would be appreciated, I've tried lots of different configurations and tinkering with the environment to make sure the versions of pytorch/NCCL/accelerate are all compatible as well.
Contents of fsdp_config_base.yaml I am using:
Relevant chunks of the sbatch script I am launching the job with:
Full stack trace:
Expected behavior
I expect training to work in the distributed setting just as it does in the single-node setting.