[BUG] terminate called after throwing an instance of 'std::bad_alloc'

Describe the bug When I run the code rlhf with trlx using deepspeed with two nodes, I met a strange problem "terminate called after throwing an instance of 'std::bad_alloc'". Memory and video memory are far from used up. Running on a separate machine works fine, but errors occur when two nodes are used. This problem occurs when I run with a docker container, but not when I don't use a container. In addition, I use anaconda environment.

ds_report output (trlx_env) root@9a3cd98dd64f:/data/work/trlx_rlhf/sft# deepspeed --hostfile=../../hostfile train_gptj_summarize.py [2023-04-03 10:49:33,397] [INFO] [runner.py:454:main] Using IP address of 10.0.128.5 for node localhost [2023-04-03 10:49:33,398] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: localhost,deepspeed-18 [2023-04-03 10:49:33,398] [INFO] [runner.py:548:main] cmd = pdsh -S -f 1024 -w localhost,deepspeed-18 export PYTHONPATH=/data/work/trlx_rlhf/sft; cd /data/work/trlx_rlhf/sft; /root/mambaforge/envs/trlx_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF0sICJkZWVwc3BlZWQtMTgiOiBbMF19 --node_rank=%n --master_addr=10.0.128.5 --master_port=29500 train_gptj_summarize.py deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0], 'deepspeed-18': [0]} deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=1 deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0], 'deepspeed-18': [1]}) deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:162:main] dist_world_size=2 deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0 localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0], 'deepspeed-18': [0]} localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=0 localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0], 'deepspeed-18': [1]}) localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:162:main] dist_world_size=2 localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0 deepspeed-18: Tokenizer loaded! localhost: Tokenizer loaded! deepspeed-18: Model loaded! deepspeed-18: Downloading and preparing dataset parquet/openai_summarize_tldr to /root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec... localhost: Model loaded! localhost: Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-0275f923823d6c0b/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) localhost: Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-0275f923823d6c0b/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) localhost: Dataset loaded! localhost: [2023-04-03 10:50:46,311] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Downloading data files: 100%|██████████| 3/3 [00:00<00:00, 10941.66it/s] Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 1896.44it/s] deepspeed-18: Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data. Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec) deepspeed-18: Dataset loaded! Downloading builder script: 100%|██████████| 6.27k/6.27k [00:00<00:00, 26.9kB/s] deepspeed-18: terminate called after throwing an instance of 'std::bad_alloc' deepspeed-18: what(): std::bad_alloc deepspeed-18: [2023-04-03 10:51:15,307] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1231493 deepspeed-18: [2023-04-03 10:51:15,308] [ERROR] [launch.py:324:sigkill_handler] ['/root/mambaforge/envs/trlx_env/bin/python', '-u', 'train_gptj_summarize.py', '--local_rank=0'] exits with return code = -6 pdsh@9a3cd98dd64f: deepspeed-18: ssh exited with exit code 250

Hostfile localhost slots=1 deepspeed-18 slots=1

Launcher context deepspeed --hostfile=../../hostfile train_gptj_summarize.py

Docker context This problem occurs when I run with a docker container, but not when I don't use a container.

Hi, I have encountered the same issue. I created a docker container on two different machines and ran DeepSpeed-Chat/training/step1_supervised_finetuning/muti_node/run_66b.sh, but I encountered the same error.

hostfile node1 slots=8 node2 slots=8

ds_report_output [2023-04-26 06:34:22,975] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: node1,node2 [2023-04-26 06:34:22,975] [INFO] [runner.py:540:main] cmd = pdsh -S -f 1024 -w node1,node2 export NCCL_VERSION=2.12.10-1; export PYTHONPATH=/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning; cd /workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning; /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJub2RlMSI6IFszLCA1XSwgIm5vZGUyIjogWzAsIDFdfQ== --node_rank=%n --master_addr=10.176.50.36 --master_port=32783 main.py --data_path 'Dahoas/rm-static' --data_split '2,4,4' --model_name_or_path '/workspace/models/opt-1.3b' --per_device_train_batch_size '1' --per_device_eval_batch_size '1' --max_seq_len '512' --learning_rate '9.65e-6' --weight_decay '0.1' --num_train_epochs '2' --gradient_accumulation_steps '1' --lr_scheduler_type 'cosine' --num_warmup_steps '0' --seed '1234' --zero_stage '3' --deepspeed --output_dir './output' node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:222:main] 0 NCCL_VERSION=2.12.10-1 node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:229:main] WORLD INFO DICT: {'node1': [3, 5], 'node2': [0, 1]} node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:235:main] nnodes=2, num_local_procs=2, node_rank=0 node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'node1': [0, 1], 'node2': [2, 3]}) node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:247:main] dist_world_size=4 node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=3,5 node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:222:main] 1 NCCL_VERSION=2.12.10-1 node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:229:main] WORLD INFO DICT: {'node1': [3, 5], 'node2': [0, 1]} node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:235:main] nnodes=2, num_local_procs=2, node_rank=1 node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'node1': [0, 1], 'node2': [2, 3]}) node2: [2023-04-26 06:34:28,199] [INFO] [launch.py:247:main] dist_world_size=4 node2: [2023-04-26 06:34:28,199] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1 node1: [2023-04-26 06:34:30,128] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl node1: Traceback (most recent call last): node1: File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 343, in node1: main() node1: File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 185, in main node1: deepspeed.init_distributed() node1: File "/opt/conda/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 588, in init_distributed node1: cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size) node1: File "/opt/conda/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 32, in init node1: self.init_process_group(backend, timeout, init_method, rank, world_size) node1: File "/opt/conda/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 58, in init_process_group node1: torch.distributed.init_process_group(backend, node1: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group node1: store, rank, world_size = next(rendezvous_iterator) node1: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler node1: store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) node1: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store node1: return TCPStore( node1: RuntimeError: Stop_waiting response is expected node1: terminate called after throwing an instance of 'std::bad_alloc' node1: what(): std::bad_alloc node1: [2023-04-26 06:34:31,325] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1846 node1: [2023-04-26 06:34:31,328] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1847 node1: [2023-04-26 06:34:31,328] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--model_name_or_path', '/workspace/models/opt-1.3b', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '3', '--deepspeed', '--output_dir', './output'] exits with return code = -6 pdsh@4e68f64d7185: node1: ssh exited with exit code 250 node2: terminate called after throwing an instance of 'std::bad_alloc' node2: what(): std::bad_alloc node2: terminate called after throwing an instance of 'std::bad_alloc' node2: what(): std::bad_alloc node2: [2023-04-26 06:34:37,245] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1395 node2: [2023-04-26 06:34:37,247] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1396 node2: [2023-04-26 06:34:37,247] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--model_name_or_path', '/workspace/models/opt-1.3b', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '3', '--deepspeed', '--output_dir', './output'] exits with return code = -6 pdsh@4e68f64d7185: node2: ssh exited with exit code 250

Launcher context the ShmSize (shm) size is 10G deepspeed --hostfile=hostfile \ --master_port xxx --master_addr xxx \ main.py ....

Both of my nodes can communicate with each other, and they are running inside docker containers. Have you found a solution to this issue yet?

microsoft / DeepSpeed

[BUG] terminate called after throwing an instance of 'std::bad_alloc' #3126