Closed andreavicini closed 3 years ago
It seems to me that with duplicates == keep
it can happen that different substitutions are matched to the same peak. If for more than one substitution the check on the intensity is successful we may have the same peak repeated multiple times in the same group. But I can take unique()
of the elements in the group to avoid that. Actually maybe it’s better this way because we give the possibility to each substitutions to have also the intensity checked and not to only one.
Regarding the duplicates = keep
, if I'm not wrong you are calculating all expected isotopologue m/z for a given monoisotopic m/z (e.g. if you have 10 isotopologues you will end up with 10 such m/z values) and you compare these than against all other m/z of peaks in the spectrum in one closest
call, something like idx <- closest(mzd, mz_spectrum, duplicates = "keep")
, thus, as a result you will always have idx
of length 10, no? it will be for each mzd
the index of the mz_spectrum
with the most similar m/z value. And if you use duplicates = keep
you would still get only one hit for each mzd
. Or maybe I'm completely wrong...
Yes, that's true. With duplicates == keep
it can happen that different mzd
are matched to the same peak in mz_spectrum
. The intensity is checked for each of these matches and if two matches to the same peak in mz_spectrum
happen to pass that check, the index of that peak is added twice in the same group (but by taking unique()
everything should be ok). The results are the same when we use low ppm. When ppm is higher on the test examples I noticed that the version with duplicates == keep
tends to form more unnecessary groups than duplicates == closest
. I guess that's because more matches are given the possibility of having the intesity tested too.
Ah, now I think I understood. With duplicates = "closest"
a peak from the spectrum would only be matched to a single mzd, while with duplicates = "keep"
the same peak from the spectrum could be reported for several isotopologue mzd, correct? I completely forgot how closest
works :(
Well, then I think it would maybe even better if we used duplicates = "keep"
, even if that means that we're increasing the number of possibly wrong assignments.
Yes you are right and I agree, using duplicates = "keep"
would be more correct.
Yes, from the unit tests it seems so. When the ppm is low the function identifies all peaks correctly. When it is higher the function forms several groups of noise peaks and more than before. That surprised me a bit. It is as if the bounds on the intensity are not able to discard matched peaks (according to mz) that much.
Yes, we have rather broad lower and upper intensity limits - that was also why I asked once if it would be possible to have the bounds to include e.g. 90% of the data points instead of 100%.
Since there are problems with
closest
function with duplicates = “closest” I was wondering if until the problem is solved I should use duplicates = “keep” and maybe remove duplicates after that in the code.