Regarding the bug that model_best.pth failed to save

microsoft / Semi-supervised-learning

A Unified Semi-Supervised Learning Codebase (NeurIPS'22)

https://usb.readthedocs.io

MIT License

1.37k stars 182 forks source link

Regarding the bug that model_best.pth failed to save #152

Closed Betty-J closed 1 year ago

Betty-J commented 1 year ago

Bug

Reproduce the Bug

Hi, I recently tried to use softmatch for training, but encountered such a situation when debugging, when the code runs to after_train_step in CheckpointHook, algorithm.it != algorithm.best_it appears in the initial stage, at this time algorithm.it= 20, algorithm.best_it=19, which caused model_best.pth to fail to save, so when loading this model later, an error FileNotFoundError: [Errno 2] No such file or directory: './model_best.pth' will be reported. I could not determine the reason for this, and I wonder if you have encountered this phenomenon. so is it necessary to add judgment conditions in CheckpointHook to ensure that model_best.pth will be saved in the initial stage. If you have any solutions or suggestions please let me know, thank you very much.

Error Messages and Logs

Hhhhhhao commented 1 year ago

Hi, I never encountered this problem before. It might caused by the fact that your process stops right before saving the best checkpoint and after saving the latest one, which is relatively rare. Can you check it this is the case?

Betty-J commented 1 year ago

Hi, I never encountered this problem before. It might caused by the fact that your process stops right before saving the best checkpoint and after saving the latest one, which is relatively rare. Can you check it this is the case?

Hi, thhank you for your advice. While debugging, I encountered a particular issue. I noticed that when using Distributed Data Parallel (DDP) during training, such as with 4 GPUs, three additional processes appear on the first GPU, and there is some GPU memory usage. Do you know the reason behind this situation? I have verified that the map_location of the pre-trained model is set to CPU during model initialization. 截屏2023-09-07 11 34 37

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 5 days with no activity.