Closed Rurouni-z closed 8 months ago
new found: When I force close (ctrl + c two or three times in a row) the running program, after being hosted by process No. 1, cannot be killed and keeps occupying the cpu.
./tools/dist_train.sh configs/softgroup++/backbone4.yaml 4
update:
I found this maybe caused by Dataloader, when enumerate data in batch. When it's loading data and kill it, then I have to restart my computer.
for i, batch in enumerate(train_loader, start=1):
you can use pkill -9 -f tools/train.py
to kill the DDP process.
After I used --dist, I encountered an error that the port number was occupied. Then after the program crashed, I tried kill or kill -9 but could not kill the process. Is there any measure to avoid or solve this problem? It has been 1 hour now. not yet killed
I'm trying to find an elegant kill on how to properly handle this kind of distributed training. BTW, i am using custom dataset