Closed malinamanolache closed 1 year ago
Did you solve the problem? For your information, we trained our model for high resolution configuration with 8 Titan RTX GPUs with 1 sample for each device which is total 8 samples for the batch size. Although we also tried using single GPU but we did not have any trouble like above.
Also, I keep noticing that some people have training issue which I cannot reproduce the problem myself. I cannot be sure but the problem might come from the CUDA version, pytorch version, accidentally using half precision, dataset problem, or others.
I would like to ask you to not modify a single script of our repository and try to reproduce the basic model first to make sure that our code is working fine on your machine. The configuration that you may find is InSPyReNet_SwinB.yaml
. Please do not change any code and just use DUTS-TR
for training. Then, evaluate on other benchmarks including UHRSD-TE
. If you can reproduce our results, then you might changed something causing the problem.
I also would like to mention that I did not trained our model many times to produce the best result for the paper. I just trained once and tested on various GPU servers and verified that our method consistently produced almost identical results, so if you solve the problem above, I can guarantee that you will get the results that you've expected, so don't give up on your project and I'll be your help as much as I can.
Hello and thank you for the great work.
While working with this project I came across a few problems and I hope you could give me some suggestions.
1. Unable to reproduce models
Firstly I tried reproducing one of the LR+HR trainings, InSPyReNet_SwinB_HU (HRSOD-TR and UHRSD-TR), but I do not obtain the same results. I gathered the results in the following table:
Although the metrics are quite close, the quality of the predictions with the model trained by me are far more inferior than the provided model. I also tried training the PlusUltraHR model and I am experiencing the same thing. Why could this happen? Why can I not reproduce the model?
2. Increasing loss during validation
Additionally, I added validation to the training script in order to monitor the model's performance during training:
InSPyReNet_SwinB_HU training & validation
For InSPyReNet_SwinB_HU training, the validation set I used is DUTS-TE. The training loss is constantly decreasing but the validation loss starts increasing after some epochs:
My assumptions were the following: the model is overfitting or the data distribution between train sets and test set is too different.
Overfitting Check
To check if overfitting is a problem I trained a LR model (using Plus_Ultra_LR config) on 43K samples ('MSRA-10K','HRSOD-TR','HRSOD-TE','ECSSD','HKU-IS','PASCAL-S','DAVIS','UHRSD-TR','UHRSD-TE','FSS-1000','DIS5K') and validated the model after each epoch on 300 images from DUTS-TE. I chose to train a LR model and only a subset of DUTS-TE for faster training. The loss during validation is still increasing:
I know that overfitting occurs when the training set has a small number of samples or the model is complex. After this experiment with 43K images I doubt that overfitting is responsible for the increase in validation loss.
Difference in data distribution Check
I was also thinking that the difference of data distribution between training sets might be too big and the model struggles in finding the optimum to accommodate all cases, making it hard to generalize. To test this, I decided to train a LR model on UHRSD2K-TR only for 150 epochs and validate the model on several testing sets:
I was expecting the loss to decrease for UHRSD2K-TE and increase for HRSOD-TE and PASCAL-S but the validation loss increases for all tests. Along the mentioned experiments, I have trained InspyreNet with different configurations and datasets and for each one the loss during validation increases. What can be the problem? Why is the validation loss always increasing?