LLM training fails with MPI

f-hafner commented 1 month ago

when running pretrain.py with 1 or 4 GPUs and the DDPStrategy as described in the docs, I get the following error

"PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/.../torch/distributed/utils.py", line 113, in _sync_module_states _sync_params_and_buffers(
"PyTorch/1.12.0-foss-2022a-CUDA-11.7.0/lib/python3.10/.../torch/distributed/utils.py", line 131 in _sync_params_and_buffers
dist._broadcast_coalesced(
IndexError: map::at
srun: error: tasks 0, 2-3: exited with exit code 1

This is related to MPI: in pretrain.py, we're using the following strategy:

ddp = DDPStrategy(process_group_backend="mpi")

We use the following bash scripts: https://github.com/odissei-lifecourse/life-sequencing-dutch/blob/main/code/llm/slurm_scripts/pretrain_multigpu.sh https://github.com/odissei-lifecourse/life-sequencing-dutch/blob/main/code/llm/slurm_scripts/pretrain.sh

f-hafner commented 1 month ago

this definitely seems to be an issue with MPI. when we run with 1 GPU and use the default strategy (ie, do not specify to use strategy=DDPStrategy(process_group_backend="mpi"), the code works fine.

f-hafner commented 1 month ago

Question from SURF: why do we not use the 2023 software stack? It's compiled against a more recent version of MPI and should be better supported.

We can try to move to the 2023 stack. @dakota0064 , @tanzir5 , do you any reason right now why this might not be possible?

benczaja commented 3 days ago

@f-hafner Do you have publicly available data I can try this out with? I am trying now in the public snellius but the data as described in the projects/dutch_real/pretrain_cfg.json

{
  "HPARAMS_PATH": "src/new_code/regular_hparams.txt",
  "CHECKPOINT_DIR": "projects/dutch_real/2017_checkpoints_0/",
  "MLM_PATH": "projects/dutch_real/gen_data/mlm_encoded_upto_2017",
  "MAX_EPOCHS": 30,
  "BATCH_SIZE": 192
}

Is not accessible outside the OSSC. I would like to try on my own outside the OSSC.

I can always try this wiithin the OSSC myself, but I don't want to screw up your data.

f-hafner commented 19 hours ago

Hi @benczaja , I don't have any publicly available data at the moment, but am working on it. I'll let you know when they are ready (~1.5-2 weeks from now). see #40

odissei-lifecourse / life-sequencing-dutch

LLM training fails with MPI #23