Currently the spectrum pairs are selected based on
An array with ALL reference scores (e.g. Tanimoto scores), for every combination of spectra.
We then aim to work against the strong bias towards low scores by selecting based on same_prob_bins which mark score bins to be drawn from with equal probability.
This mostly works. It drastically reduced the bias in the scores though not fully (for many spectra there are simply no pair with a particular score in the data). The biggest concern is that it is very ineffective memory-wise (see also #127).
I think both things should be reconsidered together.
This likely means we also need to provide a full routine that starts from matchms Spectrum objects (with or without fingerprints), but not yet expect a precomputed score array...
Currently the spectrum pairs are selected based on
same_prob_bins
which mark score bins to be drawn from with equal probability.This mostly works. It drastically reduced the bias in the scores though not fully (for many spectra there are simply no pair with a particular score in the data). The biggest concern is that it is very ineffective memory-wise (see also #127).
I think both things should be reconsidered together. This likely means we also need to provide a full routine that starts from matchms Spectrum objects (with or without fingerprints), but not yet expect a precomputed score array...