thangvubk / SoftGroup

[CVPR 2022 Oral] SoftGroup for Instance Segmentation on 3D Point Clouds
MIT License
346 stars 81 forks source link

i dont know how to kill a hanging progress #200

Closed Rurouni-z closed 8 months ago

Rurouni-z commented 8 months ago

After I used --dist, I encountered an error that the port number was occupied. Then after the program crashed, I tried kill or kill -9 but could not kill the process. Is there any measure to avoid or solve this problem? It has been 1 hour now. not yet killed

CUDA_VISIBLE_DEVICES=0 nohup ./tools/dist_train.sh configs/softgroup++/backbone4.yaml 1 > output/semantic/output4.log 2>&1 &

I'm trying to find an elegant kill on how to properly handle this kind of distributed training. BTW, i am using custom dataset

Rurouni-z commented 8 months ago

new found: When I force close (ctrl + c two or three times in a row) the running program, after being hosted by process No. 1, cannot be killed and keeps occupying the cpu.

./tools/dist_train.sh configs/softgroup++/backbone4.yaml 4

update:

I found this maybe caused by Dataloader, when enumerate data in batch. When it's loading data and kill it, then I have to restart my computer.

for i, batch in enumerate(train_loader, start=1):

thangvubk commented 8 months ago

you can use pkill -9 -f tools/train.py to kill the DDP process.