matchms / ms2deepscore

Deep learning similarity measure for comparing MS/MS spectra with respect to their chemical similarity
Apache License 2.0
48 stars 22 forks source link

Change cherry picking algorithm #146

Closed niekdejonge closed 11 months ago

niekdejonge commented 11 months ago

There is a risk of high fluctuation between the number of pairs in a bin per inchikey.

If you have 5 spectra in a row with no inchikey between 0.9 and 1.0 and the next spectrum has more than a 100, this implementation will result in 100 inchikeys for this inchikey even if the next inchikey also had more matches in this bin.

The behaviour I would prefer is, that it would increase the max bin for all other inchikeys (also the one already calculated). An option could be storing at first 2x max_pairs_per_bin if available and at the end after calculating all tanimoto scores determine how high the max_pair_per_bin should be to reach an average that matches the defined max_pair_per_bin.

This will be a bit more complex in implementation and result in a bit extra overhead and intermediate storage, but I think it is still doable and it reduces the risk of introducing other biases, like oversampling clusters with many similar inchikeys. It also would make the resulting max_pairs_global be always 0 (unless there is a very extreme distribution).

florian-huber commented 11 months ago

Sorry @niekdejonge, my commits in this PR where supposed to end up in #147 ... My goal was to make the code easier to read. Feel free to change/suggest edits!