microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.23k stars 4.08k forks source link

[BUG] exits with return code = -11 #3989

Closed KeepAndWin closed 1 year ago

KeepAndWin commented 1 year ago

Describe the bug I have run the code successfully on a machine with x4 1080Tis. However, when I ran the same code on a machine with x2 3090s, deepspeed report Kill subprocess after [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect). In the end, exits with return code = -11 is prompted.

ds_report output image

Screenshots

[2023-07-19 10:43:05,393] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-19 10:43:06,114] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-07-19 10:43:06,115] [INFO] [runner.py:555:main] cmd = /home/hitwh2021/anaconda3/envs/bit/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None src/clip_fine_tune_deepspeed.py --dataset CIRR --api-key HoWzEpTy4klumwh44YcBem6Ia --workspace keepandwin --experiment-name general --num-epoch 2 --clip-model-name RN50x4 --encoder both --learning-rate 2e-6 --batch-size 128 --transform targetpad --target-ratio 1.25 --save-training --save-best --validation-frequency 1 --deepspeed-config ./ds_config.json
[2023-07-19 10:43:06,881] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-19 10:43:07,262] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-07-19 10:43:07,262] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-07-19 10:43:07,262] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-07-19 10:43:07,262] [INFO] [launch.py:163:main] dist_world_size=2
[2023-07-19 10:43:07,262] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-07-19 10:43:08,434] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-19 10:43:08,442] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-19 10:43:09,278] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 32047
[2023-07-19 10:43:09,278] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 32048
[2023-07-19 10:43:09,286] [ERROR] [launch.py:321:sigkill_handler] ['/home/hitwh2021/anaconda3/envs/bit/bin/python', '-u', 'src/clip_fine_tune_deepspeed.py', '--local_rank=1', '--dataset', 'CIRR', '--api-key', 'HoWzEpTy4klumwh44YcBem6Ia', '--workspace', 'keepandwin', '--experiment-name', 'general', '--num-epoch', '2', '--clip-model-name', 'RN50x4', '--encoder', 'both', '--learning-rate', '2e-6', '--batch-size', '128', '--transform', 'targetpad', '--target-ratio', '1.25', '--save-training', '--save-best', '--validation-frequency', '1', '--deepspeed-config', './ds_config.json'] exits with return code = -11

System info (please complete the following information):

jomayeri commented 1 year ago

@KeepAndWin, unfortunately I cannot repro this error because I do not have access to those specific GPU types. Most likely there is something incompatible with Cuda or the built in ops on that second device. I suggest trying basic pytorch+cuda first to ensure that works.

jeffra commented 1 year ago

@KeepAndWin please see this thread for latest discussion on this: https://github.com/microsoft/DeepSpeed/issues/4002

jeffra commented 1 year ago

It seems both -7 and -11 are related to shared memory issues with docker. Please see this reply that has fixed other people's recent issues: https://github.com/microsoft/DeepSpeed/issues/4002#issuecomment-1644268195

jomayeri commented 1 year ago

Closing for now.

Anonymousplendid commented 1 year ago

Closing for now.

So how do you solve the issue?

Coderella-z commented 1 year ago

Closing for now.

So how do you solve the issue?

你解决这个问题了吗?我也遇到了这个问题