matchms / ms2deepscore

Deep learning similarity measure for comparing MS/MS spectra with respect to their chemical similarity
Apache License 2.0
53 stars 24 forks source link

mass shift data augmentation #133

Closed niekdejonge closed 1 year ago

niekdejonge commented 1 year ago

The current data augmentation does:

What I would like to add: Do mass shift augmentation. Currently the peaks are fixed to a specific bin, by randomly moving their peaks within their mass accuracy you get a better representation of real peaks. This also solves the binning issue somewhat. So for instance randomly change the mass with 5 ppm. In this way some peaks that are on the border of a bin will get assigned to a different bin once in a while. This will help to learn that these are actually the same bin during training.

@florian-huber What do you think?

niekdejonge commented 1 year ago

I now realize that this is not easy to implement in the current implementation, since binned spectrums do not contain the original peaks but only the bin indexes. A rewrite is possible were the original data is stored in a binnedspectrum and can be used.

An alternative is just increasing the bin by one or decreasing the bin by one at random. This will be very easy to implement, but the downside is that there is no difference between peaks that are on one or the other side of a bin. For example: A value 110.01 will go into the 109 bin as often as in the 111 bin, while with the first implementation it will end up often in the 109 bin, but never in the 111 bin.

justinjjvanderhooft commented 1 year ago

What would be the purpose of mass shift augmentation? Introducing (random) noise? In that case, the at random increasing or decreasing bin label would work, right? I agree that it would be neater to make it jump to the nearest bin, but that then only works well for values near the bin end/start....

niekdejonge commented 1 year ago

Yes exactly and the goal of the random noise would be better generalization to real data.

You also only want it to work for values near the start and the end of the bin, since for real data we also do not expect the measurement to be so inaccurate that it will skip an entire bin. So that would be the downside of just going up or down 1 bin at random, since this would be less represent of noise you would notice in real data.

justinjjvanderhooft commented 1 year ago

Well, I guess it depends on what kind of noise we want to simulate - the binning is not ideal, but gives a workable solution. Indeed, to simulate binning-related noise, your described approach would work best, but it comes with the downside of having to store more information to do this correct mapping. I know Kevin displays all original values of the mass fragments in a bin in the fragment plots he created, so you may want to talk to him....

florian-huber commented 1 year ago

@niekdejonge I would be skeptical about a mass shift augmentation. As far as I remember, the augmentations we already have had some effect but not something very drastic. A mass shift augmentation would probably not add that much to the story. Plus, it will be a small effect if we keep it realistic (i.e. within typical measurement variation) because only very rarely will a peak jump to a neighboring bin.

I we want to expand the data augmentation (which is reasonable), we could maybe think of ways that are closer to actual measurement practice. Maybe we could to some extent mimic the effect of changing collision energies leading to systematic shifts in the peak intensity. But that seems already half-way towards synthetic data 😄 .

niekdejonge commented 1 year ago

You are right! I did not fully think through how accurate mass spectrometers are nowadays. In my mind the mass inaccuracies were larger and therefore switching of mass bin would be more common. Since this is not the case this would indeed be very uncommon and probably not worth it.

Yes that is quickly going towards synthetic data indeed. Might in a few years also be worth trying, but for now we should maybe still wait a bit.