Closed SusuXu closed 1 year ago
Hi, we are refactoring codes. Server and inference engine will preempt CPU now, which may lead to lag. This will be solved soon.
Has the issue been fixed?
Has the issue been fixed?
Yes, it has been fixed. We have updated a lot. https://github.com/hpcaitech/EnergonAI This issue was closed due to inactivity. Thanks.
When I trained the model or conducted the inference in docker container. The model just runs forever and falls into deadlock. It occupied all four GPUs with 100% GPU usage but around 1200MB GPU memory for each GPU. Do you have any idea of why it falls into deadlock? I suspect it is the multiprocessing issue.
sudo docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config hpcaitech/energon-ai:latest ++ dirname /config/server.sh
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 2
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
[09/08/22 13:59:19] INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:2 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:3 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:3
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:4 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:4
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:5 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:5
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:6 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:6
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:7 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:7
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:8 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:8
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:9 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:9
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:10 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:10
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:11 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:11
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:12 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:12
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:13 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:13
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:14 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:14
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:15 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:15
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 0
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 1
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 3
INFO colossalai - torch.distributed.distributed_c10d - INFO: Added key: store_based_barrier_key:16 to store for rank: 2
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - torch.distributed.distributed_c10d - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:16
with 4 nodes.
INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
[09/08/22 13:59:20] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[09/08/22 13:59:20] INFO colossalai - colossalai - INFO: /opt/conda/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2