Open f-hafner opened 1 month ago
this definitely seems to be an issue with MPI. when we run with 1 GPU and use the default strategy (ie, do not specify to use strategy=DDPStrategy(process_group_backend="mpi")
, the code works fine.
Question from SURF: why do we not use the 2023 software stack? It's compiled against a more recent version of MPI and should be better supported.
We can try to move to the 2023 stack. @dakota0064 , @tanzir5 , do you any reason right now why this might not be possible?
@f-hafner Do you have publicly available data I can try this out with?
I am trying now in the public snellius but the data as described in the projects/dutch_real/pretrain_cfg.json
{
"HPARAMS_PATH": "src/new_code/regular_hparams.txt",
"CHECKPOINT_DIR": "projects/dutch_real/2017_checkpoints_0/",
"MLM_PATH": "projects/dutch_real/gen_data/mlm_encoded_upto_2017",
"MAX_EPOCHS": 30,
"BATCH_SIZE": 192
}
Is not accessible outside the OSSC. I would like to try on my own outside the OSSC.
I can always try this wiithin the OSSC myself, but I don't want to screw up your data.
Hi @benczaja , I don't have any publicly available data at the moment, but am working on it. I'll let you know when they are ready (~1.5-2 weeks from now). see #40
when running
pretrain.py
with 1 or 4 GPUs and the DDPStrategy as described in the docs, I get the following errorThis is related to MPI: in
pretrain.py
, we're using the following strategy:We use the following bash scripts: https://github.com/odissei-lifecourse/life-sequencing-dutch/blob/main/code/llm/slurm_scripts/pretrain_multigpu.sh https://github.com/odissei-lifecourse/life-sequencing-dutch/blob/main/code/llm/slurm_scripts/pretrain.sh