Selection of validation dataset

jakeret / tf_unet

Generic U-Net Tensorflow implementation for image segmentation

GNU General Public License v3.0

1.9k stars 748 forks source link

Selection of validation dataset #163

Open pity2003 opened 6 years ago

pity2003 commented 6 years ago

I noticed that both the validation data and mini-batch training data are selected from the same set of training images. In this context, there may be overlapping between the validation and training set. Will this lead to over-fitting?

Thanks.

jakeret commented 6 years ago

It's correct (and a flaw of the current implementation) that in the default implementation the validation is not optimal. It's possible that this might lead to overfitting or at least to an biased estimation of the net performance. I'm not sure if there is a way to change that such that the implementation remain backward compatible

pity2003 commented 6 years ago

My solution is to add a parameter to the "call" method of "BaseDataProvider". This parameter is used to decide as to whether or not the current loading is for validation or training/prediction. If the loading is for validation, remove the corresponding files from "self.data_files" and reset "self.file_idx = -1". After this, loading any mini-batch will only deal with the remaining files. In this case, however, the validation set has to be loaded before the training data.

Maybe there is the better solution than mine.

Thanks.

jakeret commented 6 years ago

I was thinking about an approach similiar to Keras where one has to provide two data_provider, one für the training and one for validation. This would allow for a clean separation