giuliaripellino / GOAR-ML-Project

2 stars 1 forks source link

hyper-parameter scan #27

Open olgeet opened 8 months ago

olgeet commented 8 months ago

finestResSideLength should be decreased until the figure stops changing minimumCountLimit should be set so that we get a smooth histogram ("blocky", in your words @gallenaxel ) I don't think the two remaining parameters should affect the output, so we need to understand why that happens. Let's come back to it after min/max normalisation + the finestResSideLength convergence

gallenaxel commented 8 months ago

Pretty sure that the "changes" in the plotting from changing the two last variables originate from the random nature of the train/val splitting. I've implemented a "fixed seed" here

olgeet commented 8 months ago

I'm not sure that is the way to go. For the method to be generalisable it needs to work no matter what the split is

gallenaxel commented 8 months ago

Sorry, didn't mean that the actual percentage split was the issue, but the fact that the seed wasn't fixed. When developing without an un-fixed seed, you can run the same code n times and get n different plots, and the only thing which has changed is "what values are in the train and validation datasets". With the fixed seed, you get the same plot every time. The fact that the seed is fixed is very important for the hyper-parameter scanning. Will make this as clear as possible with a presentation this week

olgeet commented 8 months ago

Yes but this is what I'm saying. The hyper-parameter scan needs to be valid for every random seed, otherwise we are overtraining by construction. You can absolutely fix the random seed to begin with to remove the randomised nature, but in the end you need to make sure that the conclusions you draw from the scan are valid for every randomisation, because it's in the random nature of the method that the generalisability lies.

The method we use estimates the unknown density from which our samples have been drawn, and since all of our datapoints are drawn from the same distribution, we should always get the same density estimate, no matter which points we use for training and validation. Of course, there will be variations due to the limited statistics, but it's important that we see some convergence.

I'm not saying this all needs to be done at once, but it's important that we check this before we publish our method.