Low metrics during inference and resumed trainings (GPU-SHARED SERVERS)

thangvubk / SoftGroup

[CVPR 2022 Oral] SoftGroup for Instance Segmentation on 3D Point Clouds

MIT License

346 stars 81 forks source link

Hello, Thanks for sharing this awesome model @thangvubk. I made my own custom dataset and trained SoftGroup with it, achieving good IoU metrics for each class of my scenes. Currently, I am trying to use the pretrained model at a certain point (epoch_999.pth) in order to perform inference analysis. However, when the ./tools/dist_test.sh $CONFIG_FILE $CHECKPOINT $NUM_GPU command is executed, I can see that my metrics are far too low in comparison to the metrics achieved in the 999th epoch of the training stage.

In other words, during the 999th epoch of the training stage I had a mIoU value close to 60% but when I tried to perform inference analysis with the 'epoch_999.pth' checkpoint, the mIoU score was ~ 4%.

The target cloud that I used for that inference analysis is a copy of one point cloud used in previous validation stages, so it should outputs much better metrics.

I do not know why is this happening, it seems that the model was not properly saved or even invoked in the inference analysis.

Thanks in advance for your attention

Solved!!!

I will proceed to explain all the steps that I followed and the logic withing them, if needed by future readers. As I told in the original post, everything went good during the training stage, and the model weights in every epoch were saved succesfully. However, the problem was at the load_checkpoint function. Testing the model, I run few training sessions but with several break points in order to resume them after, and the classification metrics evolved well until all those breakpoints. When I tried to resume those trainings from a checkpoint using the flag --resume $RESUME in dist_train.sh (not implemented in the original code) the validation metrics seemed to start from scratch instead of continue the same performance as the previous epochs until each break point.

I was running SoftGroup in a Multi-User Server, so the function load_checkpoint at line 115 was not running properly since it takes directly data storaged in the GPU(s) used, and in my case it is shared with more users and I guess that all data collected may be removed between user sessions. Following some discussions in PyTorch Forums (check this for more information) and the Issue #9139 of PyTorch official repository, I replaced the line 115 in softgroup.utils.py by the following: state_dict = torch.load(checkpoint, map_location='cpu') , so you can force the tensor map_location to CPU instead of GPU, as done by default in SoftGroup code.

thangvubk / SoftGroup

Low metrics during inference and resumed trainings (GPU-SHARED SERVERS) #173

Solved!!!