jocicmarko / ultrasound-nerve-segmentation

Deep Learning Tutorial for Kaggle Ultrasound Nerve Segmentation competition, using Keras
MIT License
939 stars 329 forks source link

Normalizing data and its affect on validation data #55

Closed SorourMo closed 6 years ago

SorourMo commented 6 years ago

Hi There, Thank you so much for sharing your code. I have a question regarding the normalization of data. You gather all of the training data into an array (imgs_train). Then calculate the mean and std, normalize whole of the imgs_train and feed it to the network. actually, this imgs_train is split into 2 parts: training and validation sets. I think when you wanna normalize your data, you shouldn't include validation data in it. You have to set the validation data apart, normalize the rest (let's call it subtraining = training - validation), then before feeding the network with the validation data, normalize it with mean and std of subtraining. It's not fair to validate the network progress over data which had already been affected by training--your validation data is biased and not fair.

I'd like to know your opinion about this issue and thank you again for this valuable code.

jocicmarko commented 6 years ago

Hi @Altosm, thanks for your input. You are 100% right! This was an oversight, because the initial code I released didn't have the validation phase at all, so normalizing on whole dataset was completely fine.

Cheers and happy training! :) Marko

SorourMo commented 6 years ago

Great. So maybe it's not bad to instead of using validation_split=0.2 in model.fit, we use other functions like the train_test_split in the sklearn package and split the subtraining and validation sets from the training data, and after that, doing normalization on just subtraining data.

jocicmarko commented 6 years ago

Exactly! :)