kuleshov-group / caduceus

Bi-Directional Equivariant Long-Range DNA Sequence Modeling
Apache License 2.0
137 stars 14 forks source link

"Unexpected segmentation fault encountered in worker" in multi-gpu training #31

Closed yangzhao1230 closed 1 month ago

yangzhao1230 commented 1 month ago

I directly tried using the slurm pretrain script you provided. The only difference is that I reduced the model parameters and replaced srun with python. It works fine in a single- environment (num_devices=1). However, occasional errors occur during the training process in a multi-gpu environment.