Using validation data - Githubissues

seg / 2016-ml-contest

Machine learning contest - October 2016 TLE

Apache License 2.0

189 stars 268 forks source link

It is usually a bad idea for the validation data to affect the trained model, such as including the validation data when computing the mean for standardization, as this can lead to overly optimistic validation scores.

If there are already lots of boreholes drilled with log data but no core facies classification, and the goal is simply to classify these wells, I could imagine potentially including all of the log data when standardizing the data. If the goal is instead to make something which can be applied to the existing wells, and also any future boreholes that are drilled, then I think it would be unwise.

Should we make this against the rules of the contest, or is it permissible in this case?

Alan, thanks for raising this, it's a good point.

Given that the data are already in the repo, we came earlier to the conclusion that we wouldn't try to tell people how to use any of the given data. So we've basically said that it's like the first scenario in your second paragraph.

My take, FWIW: in such a small dataset, it's quite likely that the training data do not represent the entire variance in all the logs... in a larger dataset, it would seem more likely that you could predict completely blind.

(BTW, with KGS we are building a larger, open-access version of this dataset. When this dataset is released, I think this would make an interesting research question.)

seg / 2016-ml-contest

Using validation data #126