microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.35k stars 4.1k forks source link

[BUG] exit with code -11 #4264

Closed Anonymousplendid closed 1 year ago

Anonymousplendid commented 1 year ago

Running LLaMA Efficient Tuning PPO scripts to train a only 560M llm with deepspeed on A100*1(Only for testing the pipeline). Without deepspeed, the code runs fine, while getting unexpected error with deepspeed. The training info is as follows: [2023-09-05 17:50:36,040] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-05 17:50:38,699] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2023-09-05 17:50:38,700] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-09-05 17:50:38,700] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-09-05 17:50:38,700] [INFO] [launch.py:163:main] dist_world_size=1 [2023-09-05 17:50:38,700] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-09-05 17:50:42,387] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-05 17:51:00,420] [INFO] [comm.py:631:init_distributed] cdb=None [2023-09-05 17:51:00,420] [INFO] [comm.py:662:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 09/05/2023 17:51:00 - INFO - llmtuner.tuner.core.parser - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, compute dtype: torch.float16 input_ids: [1, 36, 44799, 5299, 267, 99579, 5579, 530, 660, 48763, 64225, 103800, 17, 1387, 103800, 19502, 66799, 15, 53180, 15, 530, 214804, 41259, 427, 368, 88331, 11732, 17, 189, 114330, 29, 121045, 8603, 63211, 613, 135576, 70349, 336, 9096, 61339, 29, 210] inputs: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. Human: Give three tips for staying healthy. Assistant: [2023-09-05 17:51:14,424] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.1, git-hash=unknown, git-branch=unknown [2023-09-05 17:51:17,740] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1800012 [2023-09-05 17:51:17,740] [ERROR] [launch.py:321:sigkill_handler] ['/python3.1', '-u', 'src/train_bash.py', '--local_rank=0', '--deepspeed', 'ds_config.json', '--stage', 'ppo', '--model_name_or_path', 'bloomz-560m', '--do_train', '--dataset', 'alpaca_gpt4_en', '--template', 'default', '--finetuning_type', 'full', '--lora_target', 'q_proj,v_proj', '--resume_lora_training', 'False', '--checkpoint_dir', 'bloomz-560m', '--reward_model', './reward_model', '--output_dir', 'PPO', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '4', '--lr_scheduler_type', 'cosine', '--logging_steps', '10', '--save_steps', '1000', '--learning_rate', '1e-5', '--num_train_epochs', '1.0', '--plot_loss', '--fp16'] exits with return code = -11

It shows that the code fails running inference. Since the code is not running in docker environment, the other issues seem not relevant. the ds_config is as follows: { "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 },
"zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 5e8, "reduce_scatter": true, "reduce_bucket_size": 5e8, "overlap_comm": false, "contiguous_gradients": true } }

ds_report is as follows: [2023-09-05 18:02:29,117] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/zhwggroup/xuedy/anaconda3/envs/llm/lib/python3.10/site-packages/torch'] torch version .................... 2.0.1+cu117 deepspeed install path ........... ['/home/zhwggroup/xuedy/anaconda3/envs/llm/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.10.1, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 12.0 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7 shared memory (/dev/shm) size .... 503.77 GB

loadams commented 1 year ago

Hi @Anonymousplendid - the "exits with return code = -11" means you are encountering a seg fault. Its not immediately obviously what might be causing this - are you seeing this with all inferences that you are running? Are you able to provide a smaller repro script for us to test with?

Anonymousplendid commented 1 year ago

Well, sure, I quite understand there is a seg fault problem. Based on my testing results, the problem lies not with the code or the config - since code runs fine on any other device, and also runs fine on the same device without deepspeed - but conflicts between the environment settings and deepspeed settings. However, ds_report shows that the settings seem quite fine, which is really confusing.

I believe it is difficult to reproduce the problem, but I am quite sure any deepspeed application will do on my device. To give a simpler example, I reproduced the problem with deepspeedchat. The log reads as follows:

[2023-09-08 08:38:52,754] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-08 08:38:55,582] [WARNING] [runner.py:201:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-09-08 08:38:55,582] [INFO] [runner.py:567:main] cmd = /home/zhwggroup/xuedy/anaconda3/envs/llm/bin/python3.1 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 8 --lora_dim 128 --zero_stage 2 --enable_tensorboard --tensorboard_path ./output --deepspeed --output_dir ./output [2023-09-08 08:38:58,175] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-08 08:39:00,956] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2023-09-08 08:39:00,956] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-09-08 08:39:00,956] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-09-08 08:39:00,956] [INFO] [launch.py:163:main] dist_world_size=1 [2023-09-08 08:39:00,956] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-09-08 08:39:04,172] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-08 08:39:07,746] [INFO] [comm.py:631:init_distributed] cdb=None [2023-09-08 08:39:07,746] [INFO] [comm.py:662:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-09-08 08:39:08,966] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3237727 [2023-09-08 08:39:08,966] [ERROR] [launch.py:321:sigkill_handler] ['/home/zhwggroup/xuedy/anaconda3/envs/llm/bin/python3.1', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '2', '--enable_tensorboard', '--tensorboard_path', './output', '--deepspeed', '--output_dir', './output'] exits with return code = -11

and ds_report reads as: 图片

By comparing with results run on other machine, the problem occurs with these codes: 图片) (with the same running settings as the deepspeedchat script). Original codes in deepspeedchat step1 supervisedtraining/main.py

Ch3nYe commented 1 year ago

the same problem.

shubhanjan99 commented 1 year ago

I'm running into the same problem, but my ds_report is missing shared memory (/dev/shm) size line, it looks like:

$ ds_report
[2023-09-12 22:08:53,408] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/azureuser/data/.pip/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/home/azureuser/data/.pip/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
Ch3nYe commented 1 year ago

https://github.com/microsoft/DeepSpeedExamples/issues/542#issuecomment-1621345542 is useful for me.

Anonymousplendid commented 1 year ago

@Ch3nYe Thanks for your help. It works.

loadams commented 1 year ago

@Ch3nYe and @Anonymousplendid - glad that works for you, it might be interesting to understand why having IB disabled helps resolve this, but I will try and follow up offline.

@shubhanjan99 - you will need to use a newer version of DeepSpeed to get the listed shm size, but that is more relevant for issues where the Docker shared memory is set too low.

yiyangh-ps commented 3 months ago

I had the same -11 exit code today with latest version of torch, accelerate, and deepspeed. And I was able to solve it by downgrading everything to previous versions.

numpy==1.26.4
torch==2.0.1
tokenizers==0.14.0
transformers==4.35.0
accelerate==0.24.1
deepspeed==0.12.2