I directly tried using the slurm pretrain script you provided. The only difference is that I reduced the model parameters and replaced srun with python. It works fine in a single- environment (num_devices=1). However, occasional errors occur during the training process in a multi-gpu environment.
I directly tried using the slurm pretrain script you provided. The only difference is that I reduced the model parameters and replaced
srun
withpython
. It works fine in a single- environment (num_devices=1
). However, occasional errors occur during the training process in a multi-gpu environment.