Closed Betty-J closed 1 year ago
Hi, I never encountered this problem before. It might caused by the fact that your process stops right before saving the best checkpoint and after saving the latest one, which is relatively rare. Can you check it this is the case?
Hi, I never encountered this problem before. It might caused by the fact that your process stops right before saving the best checkpoint and after saving the latest one, which is relatively rare. Can you check it this is the case?
Hi, thhank you for your advice. While debugging, I encountered a particular issue. I noticed that when using Distributed Data Parallel (DDP) during training, such as with 4 GPUs, three additional processes appear on the first GPU, and there is some GPU memory usage. Do you know the reason behind this situation? I have verified that the map_location of the pre-trained model is set to CPU during model initialization.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.
Bug
Reproduce the Bug
Hi, I recently tried to use softmatch for training, but encountered such a situation when debugging, when the code runs to after_train_step in CheckpointHook, algorithm.it != algorithm.best_it appears in the initial stage, at this time algorithm.it= 20, algorithm.best_it=19, which caused model_best.pth to fail to save, so when loading this model later, an error FileNotFoundError: [Errno 2] No such file or directory: './model_best.pth' will be reported. I could not determine the reason for this, and I wonder if you have encountered this phenomenon. so is it necessary to add judgment conditions in CheckpointHook to ensure that model_best.pth will be saved in the initial stage. If you have any solutions or suggestions please let me know, thank you very much.
Error Messages and Logs