Open sneumann opened 7 years ago
This would be a unit test for this case
test.getMoleculeWithPositiveCharge <- function() {
checkEqualsNumeric(852.354928,
getMolecule("C49H56FeN4O6",initializePSE(),z=1)$exactmass,
tolerance = 0.00005)
}
Using the example by Dzmitry Mukha, and exact mass 1563.2434 obtained from http://www.chemcalc.org/analyse?mf=C89H176O16P2&peakWidth=0.001&msResolution=1563243&resolution=0.001&referenceVersion=2013
test.getMoleculeWithLargeMass <- function() {
checkEqualsNumeric(1563.2434,
getMolecule("C89H176O16P2",initializePSE(),z=1)$exactmass,
tolerance = 0.00005)
}
Issue is that Rdisop reports 1564.247
as exact mass, which is the mass of the most abundant
(second) isotope:
> getMolecule("C89H176O16P2",initializePSE(),z=1)$isotopes
[[1]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1563.2433642 1564.2467997 1565.2500834 1.566253e+03 1.567256e+03
[2,] 0.3503048 0.3581471 0.1923895 7.191911e-02 2.093433e-02
http://www.envipat.eawag.ch/index.php also agrees that the monoisotopic mass should be 1563.243363
Comment on the code:
mass of an ComposedElement
is obtained in
https://github.com/sneumann/Rdisop/blob/master/src/disop.cpp#L651
But I am unsure where in the C++ source this is actually calculated.
Well, I have an R-only fix that will likely slow down the function, but works. I am pretty lost in the C code... I am decomposing the molecular formula using a function from the CHNOSZ package, but I am sure this can be ported over. Can't say it is an elegant fix, but this is how the calculations can be done. I will keep looking at the C code to see if I can find where this might be happening, to enable selection there.
getMolecule <- function(formula = "C49H56FeN4O6", elements = NULL, z = 0, maxisotopes=10) {
# Use full PSE unless stated otherwise
if (!is.list(elements) || length(elements)==0 ) {
elements <- initializePSE()
}
# Remember ordering of element names,
# but ensure list of elements is ordered
# by mass
element_order <- sapply(elements, function(x){x$name})
elements <- elements[order(sapply(elements, function(x){x$mass}))]
## MONOISOTOPIC EXTRACTION
mono<-sapply(1:length(elements), FUN = function(x) {
m<-which.max(elements[[x]]$isotope$abundance)
elements[[x]]$mass + (m-1) + elements[[x]]$isotope$mass[m]
}
)
names(mono)<-sapply(1:length(elements), FUN = function(x) {elements[[x]]$name})
# Call imslib to parse formula and calculate
# masses and isotope pattern
molecule <- .Call("getMolecule",
formula, elements, element_order,
z, maxisotopes,
PACKAGE="Rdisop")
## FIX MONOISOTOPIC HERE
library('CHNOSZ')
f<-makeup(formula)
molecule$exactmass <- sum(mono[names(f)]*f)
molecule
}
Some observations:
element.h: does not define getMass function, but line 157 appears to define the MONOISOTOPIC (MONOISOTOPIC is a typedef defined in line 52) mass. Line 169 suggests that the first isotope found with an abundance of > 0.5 is taken as the MONOISOTOPIC. It is unclear to me what happens if there are no isotopes > 0.5 abundance. Line 57 indicates that MONOISOTOPIC is initially set to 0. tests in R suggest that this is not the issue - getMolecule(formula = "C49H56FeN4O6") works, while "C89H176O16P2" does not. This indicates that the unusual Fe isotope patterns is not the source of the problem, but rather that the problem lies in the method used to calculate monoisotopic mass for a molecule It may be that the full isotope pattern is being calculated, then monoisotopic (incorrecly) assigned.
in the disop.cpp file, line 248, decompositions_t appears to be selecting the first index rather than an explicit monoisotopic with this line "decomposer.getDecompositions(masses(0), error);"
in the eispectraanalysis.cpp file, line 249, there is reference to: // updates molecule's isotope distribution to calculate its molecule's monoisotopic mass candidate_molecule.updateIsotopeDistribution(); this updateIsotopeDistribution function could be problematic? I haven't looked into this in depth yet.
also see: "decomposer.getAllDecompositions(integer_mono_mass)" Why is there frequent reference to integer monoisotopic mass? Could this rounding be the problem? Seems highly unlikely, given that we get decimals out.
I was thinking 1. initially, but I don't think that is the problem. Rather, I suspect 2 as the most likely right now.
This is an old discussion. However, I would like to contribute my two cents and make a suggestion to close it. :) Firstly, I dont think that Rdisop does anything wrong here, as 'exact mass' ist not an equivalent term for 'monoisotopic mass'. From the Wikipedia Link @sneumann posted above:
When an exact mass value is given without specifying an isotopic species, it normally refers to the most abundant isotopic species.
Hence, I would suggest to update the documentation to clarify this.
@cbroeckl already suggested a workaround function to compute the monoisotopic mass (MIM). While he used the CHNOSZ
package, I would avoid another dependency and implement a solution to compute the MIM based on the PSE information already present in Rdisop.
Specifically, I would extend Rdisop by a new exported function getMono
which accepts a vector of formulas (or alternatively a list of molecules, to be consistent with the other extractor functions) and returns the true MIMs.
@sneumann let me know if I shall prepare a PR implementing the above solution.
Bests [jan]
Hi Jan, thanks for picking this up ! I am all for clear documentation, and also against growing dependencies where technically not necessary. Yours, Steffen
I created a PR (v1.65.4). Once accepted, #8 could be closed.
Reported by Dzmitry Mukha:
Indeed,
getMass()
is reporting the mass of the most abundant isotope, which is the same as the monoisotopic mass for individual element atoms. However, for molecules, this is not neccessarily the monoisotopic mass, which might be the expected result. For the molecule monoisotopic mass, the sum of the monoisotopic element masses would have to be considered.See also https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Monoisotopic_mass for definitions and examples.