hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.78k stars 4.34k forks source link

Deadlock when starting in docker container #1579

Closed SusuXu closed 2 years ago

SusuXu commented 2 years ago

๐Ÿ› Describe the bug

When I trained the model or conducted the inference in docker container. The model just runs forever and falls into deadlock. It occupied all four GPUs with 100% GPU usage but around 1200MB GPU memory for each GPU. Do you have any idea of why it falls into deadlock? I suspect it is the multiprocessing issue.

sudo docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config hpcaitech/energon-ai:latest ++ dirname /config/server.sh

Environment

I'm using docker container provided by this github repo on a cluster with four RTX A6000 GPU.

EricLingRui commented 1 year ago

@SusuXu How to solve it, please?