matchms / ms2deepscore

Deep learning similarity measure for comparing MS/MS spectra with respect to their chemical similarity
Apache License 2.0
48 stars 22 forks source link

Calculate tanimoto scores during training #127

Open niekdejonge opened 1 year ago

niekdejonge commented 1 year ago

For training we currently need a Tanimoto score matrix. This is calculated relatively fast for the current GNPS library (25000 inchikeys), but requires a file >5 GB. Since this scales quadratically this does not seem to be a suitable solution for larger and larger libraries. I already got issues reported for MS2Query were the server did not have enough RAM to calculate this for 100.000 vs 100.000 spectra. It might be possible to include the calculation in the data generators. This will make the training more straightforward for the user and have better scalability.

The fingerprints should of course still be precalculated and stored, but this scales linearly and therefore is not such a big issue. My expectation is that this will not slow down the training time by a lot and will even improve training time for larger datasets, since not all tanimoto score combinations will have to be calculated.

I realize that one of the steps in the calculation process is to select a match from a tanimoto bin. This would mean that it needs to keep calculating Tanimoto scores until it finds a match in that bin, this might be time consuming. Do you think this would be an issue?

@florian-huber What do you think?

florian-huber commented 1 year ago

Sorry, coming back to this with some delay...

I think that computing the scores on the fly during training is not a good option. We often search for rare cases (say Tanimoto between 0.8 and 0.9) and that could mean that we need to compute many 1,000s of scores before getting to the "right" pair.

There are more performant options in my opinion. In particular, because we will never train a model on all pairs anyway (100,000 x 100,000 pairs are a lot).

1) We can switch to a sparse Tanimoto score array that only contains a fraction of the scores. This could -for instance- be done by aiming at class balance, i.e. similar proportions of pairs for all Tanimoto score bins. For example: if we have 1e8 scores between 0.2 and 0.3, but only 1e5 scores between 0.8 and 0.9 we could decide to drop nearly all of the pairs between 0.2 and 0.3 (or at least reduce it by a very large amount).

Advantages: Fairly easy to implement since it is done in the preprocessing. The rest of the training pipeline can remain untouched.

Cons: We won't make use of the full variety of the large dataset. And, maybe even more critical, we have to be even more careful in how we select the pairs we keep (which biases do we take into account? Which not?)

2) We do a full re-design the training process. There are very different options of what we could do. For instance, pre-compute all pairs before starting the training which could even speed up the training by a lot (but require a lengthy preprocessing). Or: we switch to a more "intelligent" storage system for similarity scores --> a graph (or graph database). That would likely require a different logic of how we select pairs. One idea: we keep the closest n spectra to each spectrum as edges in the graph and during selection we randomly chose how far we "hop" through the graph. Another idea: we simply keep a fixed amount of edges for each spectrum (say 100 or 1000) which would reduce the amount of data by a lot. This resemble option (1) but it would still allow to also take pairs that are several steps away from each other in the graph.

florian-huber commented 1 year ago

Between: graph options will likely also become interesting for matchms in the near future, simply because they are a better way to store the relevant information than the all-vs-all arrays that we operate with right now.

niekdejonge commented 1 year ago

Yes I agree, generating a sparse matrix (or graph) will solve the scalability issue, without repeatedly calculating the same scores. Before switching to new ways of pair generating I hope to get a bit more intuitive understanding of their effect, since to me the benefit of DataGeneratorsSpectrums over DataGeneratorInchikeys is still not fully intuitive.

florian-huber commented 11 months ago

I think this will be fixed with #145.