First: sorry, this got a bit out of hand due to running of isort which means that most of the changed files merely contain import linting.
Actual main part is: addition of functions to compute future training pairs in a way that will close #144, #127, and #107.
A new generator (DataGeneratorCherrypicked ... name is up for discussion;) ) based on new functions which are called in the following way:
from ms2deepscore.spectrum_pair_selection import (compute_spectrum_pairs,
SelectedCompoundPairs)
scores_selected, inchikeys14 = compute_spectrum_pairs(spectrums)
scp = SelectedCompoundPairs(scores_selected, inchikeys14)
scp.next_pair_for_inchikey("... here: some inchikey") # class contains a counter for each inchikey
The underlying function compute_jaccard_similarity_matrix_cherrypicking will pick (on average) max_pairs_per_bin for each score bin in selection_bins. Say we use 20 pairs per bin and our default 10 bins between 0 and 1, then the function would output 200 scores + 200 row and 200 column values per compound/fingerprint/inchikey.
This is much less than the all-vs-all approach we used so far! (e.g. from 25.000 x 25.000 to 25.000 x 600)
First: sorry, this got a bit out of hand due to running of
isort
which means that most of the changed files merely contain import linting.Actual main part is: addition of functions to compute future training pairs in a way that will close #144, #127, and #107.
A new generator (
DataGeneratorCherrypicked
... name is up for discussion;) ) based on new functions which are called in the following way:The underlying function
compute_jaccard_similarity_matrix_cherrypicking
will pick (on average)max_pairs_per_bin
for each score bin inselection_bins
. Say we use 20 pairs per bin and our default 10 bins between 0 and 1, then the function would output 200 scores + 200 row and 200 column values per compound/fingerprint/inchikey. This is much less than the all-vs-all approach we used so far! (e.g. from 25.000 x 25.000 to 25.000 x 600)