Create fully deterministic validation generator

matchms / ms2deepscore

Deep learning similarity measure for comparing MS/MS spectra with respect to their chemical similarity

Apache License 2.0

52 stars 23 forks source link

Create fully deterministic validation generator #172

Closed florian-huber closed 8 months ago

florian-huber commented 8 months ago

While the current pytorch data generator already has a validation mode (use_fixed_set = True), this only guarantees that every epoch will see the same validation data. But it changes for every initialization and hence is not well comparable across model trainings.

What make this a bit difficult is:

there is randomness in the data generator (could maybe be fixed with a seed)
there is also randomness from the side of SelectedCompoundPairs!

florian-huber commented 8 months ago

I would probably simply make sure that we generate one fixed set of spectrum pairs which we store and re-use.

florian-huber commented 8 months ago

The DataGeneratorPytorch can now work deterministically.

But it is too difficult to make the same happen for select_compound_pairs_wrapper. One of the reasons is that compute_jaccard_similarity_per_bin was parallelized and we cannot control the exact order of the thread execution. There are workarounds, but that seems a bit too complicated. And I might still overlook things at the end.

So the best option (most reliable that is) seems to be to just create one set of pairs and store the spectrum IDs and scores.

florian-huber commented 8 months ago

We went a different way --> #177

This means, however, that we need to use a fixed set of validation spectra to guarantee consistent reproducible results.