DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
33.72k
stars
3.96k
forks
source link
[BUG] [ERROR] [launch.py:321:sigkill_handler [xxx] exits with return code = -9 #4890
Open
xinbingzhe opened 6 months ago
Describe the bug
[ERROR] [launch.py:321:sigkill_handler [xxx] exits with return code = -9
My script works well for training llama of huggingface transformers model, but it failed when I replace mlp layer with deepspeed moe layer.
The error occurs when training 5 steps or 6 steps.
error info
my train script
ds_report output
System info: