When I trained the model or conducted the inference in docker container. The model just runs forever and falls into deadlock. It occupied all four GPUs with 100% GPU usage but around 1200MB GPU memory for each GPU. Do you have any idea of why it falls into deadlock? I suspect it is the multiprocessing issue.
sudo docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config hpcaitech/energon-ai:latest
++ dirname /config/server.sh
cd /config
export BASE=/config
BASE=/config
export PYTHONPATH=/config
PYTHONPATH=/config
energonai service init --config_file=/config/opt_config.py
Energon Init Configurations:
opt_30B : <function opt_30B at 0x7fa870a60550>
opt_125M : <function opt_125M at 0x7fa870a604c0>
opt_175B : <function opt_175B at 0x7fa870a60670>
launch_engine : <function launch_engine at 0x7fa85e379b80>
model_class : <function opt_125M at 0x7fa870a604c0>
model_type : gpt
host : 127.0.0.1
port : 29402
half : True
checkpoint : home/susu/opt_metaseq_125m/model/restored.pt
backend : nccl
tp_init_size : 4
pp_init_size : 1
engine_server : <function launch_engine at 0x7fa85e379b80>
tokenizer_path : facebook/opt-30b
server_host : 0.0.0.0
server_port : 8020
log_level : info
allow_cors : True
executor_max_batch_size : 16
cache_size : 50
cache_list_size : 2
timeout_keep_alive : 180
executor_max_queue_size : 0
fixed_cache_keys : [('Question: What is the name of the largest continent on earth?\nAnswer: Asia\n\nQuestion: What is at the center of the solar system?\nAnswer:', 64), ('A chat between a salesman and a student.\n\nSalesman: Hi boy, are you looking for a new phone?\nStudent: Yes, my phone is not functioning well.\nSalesman: What is your budget? \nStudent: I have received my scholarship so I am fine with any phone.\nSalesman: Great, then perhaps this latest flagship phone is just right for you.', 64), ("English: I am happy today.\nChinese: ๆไปๅคฉๅพๅผๅฟใ\n\nEnglish: I am going to play basketball.\nChinese: ๆไธไผๅปๆ็ฏฎ็ใ\n\nEnglish: Let's celebrate our anniversary.\nChinese:", 64)]
max_batch_size : 32
dtype : torch.float16
rm_padding : False
seed : 1024
verbose : True
trt_sample : None
Downloading vocab.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 878k/878k [00:00<00:00, 20.4MB/s]
Downloading merges.txt: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 446k/446k [00:00<00:00, 10.6MB/s]
Downloading special_tokens_map.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 221/221 [00:00<00:00, 149kB/s]
Downloading tokenizer_config.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 685/685 [00:00<00:00, 522kB/s]
Downloading config.json: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 651/651 [00:00<00:00, 412kB/s]
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 1
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 2
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
[09/08/22 13:59:20] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[09/08/22 13:59:20] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2
Environment
I'm using docker container provided by this github repo on a cluster with four RTX A6000 GPU.
๐ Describe the bug
When I trained the model or conducted the inference in docker container. The model just runs forever and falls into deadlock. It occupied all four GPUs with 100% GPU usage but around 1200MB GPU memory for each GPU. Do you have any idea of why it falls into deadlock? I suspect it is the multiprocessing issue.
sudo docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config hpcaitech/energon-ai:latest ++ dirname /config/server.sh
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 2
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
[09/08/22 13:59:20] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[09/08/22 13:59:20] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2
Environment
I'm using docker container provided by this github repo on a cluster with four RTX A6000 GPU.