Retrained zero shot results are inferior to the public scores

JanySunny commented 5 years ago

I retrained the zero shot model by using train_zero_shot_youtube.sh based on the given settings, obtained the inference results based on the eval_zero_shot_youtube.sh, and then prepared the submission results based on prepare_results_submission.py for the YouTubeVOS challenge official website. However, the test results on YouTubeVOS cannot match the public scores. Are there any other settings or tricks during training and testing? I found data argumentation is used in training while not in testing, and I absolutely did as the public settings. The models are trained for 50 epochs on a single TitanX GPU (batch_size=4, clips=5). The following is the retrained results:

retrain-RVOS-T: 33.87, 18.37, 38.62, 22.23

retrain-RVOS-S: 38.52, 18.72, 41.70, 22.59

retrain-RVOS-ST: 41.56, 21.46, 45.00, 24.52

Besides I also used the public zero shot youtube model for youtube-vos testing, I got the following scores: pub-RVOS-ST: 43.39, 21.10, 45.30, 24.32.

It seems the inferior retrain results are not due to the test settings, but I do not know why, can you help me?

carlesventura commented 5 years ago

There could be at least two reasons for the different results when retraining:

A new training, will see the images and the instances in a different order, and the data augmentation techniques applied will be also different. This can give as a result a model which can be slightly better or worse than the one trained and released by ourselves.
For the zero-shot case, we trained the model for 40 epochs. Even if the validation loss (obtained with the train-val subset) of the model could be better when trained for 50 epochs, this doesn't mean that the results obtained in the validation set will be better than the ones obtained by the released model.

Best regards,

Carles

JanySunny commented 5 years ago

@carlesventura Thanks for your kind answer. Then, how to choose the final model after one training (eg, 40 or more epochs)? Take the one with the best validation loss (obtained with the train-val subset)? Maybe it is overfitting. Or test several or all trained models (checkpoints) on test set? It seems not advisable. Thank you.

Best regards

imatge-upc / rvos

Retrained zero shot results are inferior to the public scores #15