Closed florian-huber closed 8 months ago
I would probably simply make sure that we generate one fixed set of spectrum pairs which we store and re-use.
The DataGeneratorPytorch
can now work deterministically.
But it is too difficult to make the same happen for select_compound_pairs_wrapper
. One of the reasons is that compute_jaccard_similarity_per_bin
was parallelized and we cannot control the exact order of the thread execution. There are workarounds, but that seems a bit too complicated. And I might still overlook things at the end.
So the best option (most reliable that is) seems to be to just create one set of pairs and store the spectrum IDs and scores.
We went a different way --> #177
This means, however, that we need to use a fixed set of validation spectra to guarantee consistent reproducible results.
While the current pytorch data generator already has a validation mode (
use_fixed_set = True
), this only guarantees that every epoch will see the same validation data. But it changes for every initialization and hence is not well comparable across model trainings.What make this a bit difficult is:
SelectedCompoundPairs
!