hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.78k stars 4.34k forks source link

Run into Deadlocks when training or inference #1580

Closed SusuXu closed 1 year ago

SusuXu commented 2 years ago

When I trained the model or conducted the inference in docker container. The model just runs forever and falls into deadlock. It occupied all four GPUs with 100% GPU usage but around 1200MB GPU memory for each GPU. Do you have any idea of why it falls into deadlock? I suspect it is the multiprocessing issue.

sudo docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config hpcaitech/energon-ai:latest ++ dirname /config/server.sh

ver217 commented 2 years ago

Hi, we are refactoring codes. Server and inference engine will preempt CPU now, which may lead to lag. This will be solved soon.

semal commented 1 year ago

Has the issue been fixed?

binmakeswell commented 1 year ago

Has the issue been fixed?

Yes, it has been fixed. We have updated a lot. https://github.com/hpcaitech/EnergonAI This issue was closed due to inactivity. Thanks.