matchms / ms2deepscore

Deep learning similarity measure for comparing MS/MS spectra with respect to their chemical similarity
Apache License 2.0
48 stars 22 forks source link

New pair generation #145

Closed florian-huber closed 11 months ago

florian-huber commented 11 months ago

First: sorry, this got a bit out of hand due to running of isort which means that most of the changed files merely contain import linting.

Actual main part is: addition of functions to compute future training pairs in a way that will close #144, #127, and #107.

A new generator (DataGeneratorCherrypicked ... name is up for discussion;) ) based on new functions which are called in the following way:

from ms2deepscore.spectrum_pair_selection import (compute_spectrum_pairs,
    SelectedCompoundPairs)

scores_selected, inchikeys14 = compute_spectrum_pairs(spectrums)
scp = SelectedCompoundPairs(scores_selected, inchikeys14)
scp.next_pair_for_inchikey("... here: some inchikey")  # class contains a counter for each inchikey

The underlying function compute_jaccard_similarity_matrix_cherrypicking will pick (on average) max_pairs_per_bin for each score bin in selection_bins. Say we use 20 pairs per bin and our default 10 bins between 0 and 1, then the function would output 200 scores + 200 row and 200 column values per compound/fingerprint/inchikey. This is much less than the all-vs-all approach we used so far! (e.g. from 25.000 x 25.000 to 25.000 x 600)