rformassspectrometry / MetaboCoreUtils

Core utilities for metabolomics.
https://rformassspectrometry.github.io/MetaboCoreUtils/index.html
8 stars 6 forks source link

isotopologue functions updates #44

Closed andreavicini closed 2 years ago

andreavicini commented 2 years ago

Since there are problems with closest function with duplicates = “closest” I was wondering if until the problem is solved I should use duplicates = “keep” and maybe remove duplicates after that in the code.

andreavicini commented 2 years ago

It seems to me that with duplicates == keep it can happen that different substitutions are matched to the same peak. If for more than one substitution the check on the intensity is successful we may have the same peak repeated multiple times in the same group. But I can take unique() of the elements in the group to avoid that. Actually maybe it’s better this way because we give the possibility to each substitutions to have also the intensity checked and not to only one.

jorainer commented 2 years ago

Regarding the duplicates = keep, if I'm not wrong you are calculating all expected isotopologue m/z for a given monoisotopic m/z (e.g. if you have 10 isotopologues you will end up with 10 such m/z values) and you compare these than against all other m/z of peaks in the spectrum in one closest call, something like idx <- closest(mzd, mz_spectrum, duplicates = "keep"), thus, as a result you will always have idx of length 10, no? it will be for each mzd the index of the mz_spectrum with the most similar m/z value. And if you use duplicates = keep you would still get only one hit for each mzd. Or maybe I'm completely wrong...

andreavicini commented 2 years ago

Yes, that's true. With duplicates == keep it can happen that different mzd are matched to the same peak in mz_spectrum. The intensity is checked for each of these matches and if two matches to the same peak in mz_spectrum happen to pass that check, the index of that peak is added twice in the same group (but by taking unique()everything should be ok). The results are the same when we use low ppm. When ppm is higher on the test examples I noticed that the version with duplicates == keep tends to form more unnecessary groups than duplicates == closest. I guess that's because more matches are given the possibility of having the intesity tested too.

jorainer commented 2 years ago

Ah, now I think I understood. With duplicates = "closest" a peak from the spectrum would only be matched to a single mzd, while with duplicates = "keep" the same peak from the spectrum could be reported for several isotopologue mzd, correct? I completely forgot how closest works :(

Well, then I think it would maybe even better if we used duplicates = "keep", even if that means that we're increasing the number of possibly wrong assignments.

andreavicini commented 2 years ago

Yes you are right and I agree, using duplicates = "keep" would be more correct.

andreavicini commented 2 years ago

Yes, from the unit tests it seems so. When the ppm is low the function identifies all peaks correctly. When it is higher the function forms several groups of noise peaks and more than before. That surprised me a bit. It is as if the bounds on the intensity are not able to discard matched peaks (according to mz) that much.

jorainer commented 2 years ago

Yes, we have rather broad lower and upper intensity limits - that was also why I asked once if it would be possible to have the bounds to include e.g. 90% of the data points instead of 100%.