sneumann / Rdisop

This is the git repository matching the Bioconductor package Rdisop: Decomposition of Isotopic Patterns
4 stars 9 forks source link

unclear getMass() semantics #8

Open sneumann opened 7 years ago

sneumann commented 7 years ago

Reported by Dzmitry Mukha:

When I used getMass(getMolecule("C89H176O16P2")) in the Rdisop, I have got drastically wrong monoisotopic mass with >1 Da difference. The result is also significantely different from the anticipated average mass (1564.288). You can see below the result of my own function getmass(x) that seems to be correct. I can send you more examples if needed.

Indeed, getMass() is reporting the mass of the most abundant isotope, which is the same as the monoisotopic mass for individual element atoms. However, for molecules, this is not neccessarily the monoisotopic mass, which might be the expected result. For the molecule monoisotopic mass, the sum of the monoisotopic element masses would have to be considered.

See also https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Monoisotopic_mass for definitions and examples.

sneumann commented 7 years ago

This would be a unit test for this case

test.getMoleculeWithPositiveCharge <- function() {
     checkEqualsNumeric(852.354928,
                        getMolecule("C49H56FeN4O6",initializePSE(),z=1)$exactmass,
                        tolerance = 0.00005)  
}
sneumann commented 6 years ago

Using the example by Dzmitry Mukha, and exact mass 1563.2434 obtained from http://www.chemcalc.org/analyse?mf=C89H176O16P2&peakWidth=0.001&msResolution=1563243&resolution=0.001&referenceVersion=2013

test.getMoleculeWithLargeMass <- function() {
     checkEqualsNumeric(1563.2434,
                        getMolecule("C89H176O16P2",initializePSE(),z=1)$exactmass,
                        tolerance = 0.00005)  
}

Issue is that Rdisop reports 1564.247 as exact mass, which is the mass of the most abundant (second) isotope:

> getMolecule("C89H176O16P2",initializePSE(),z=1)$isotopes
[[1]]
             [,1]         [,2]         [,3]         [,4]         [,5]
[1,] 1563.2433642 1564.2467997 1565.2500834 1.566253e+03 1.567256e+03
[2,]    0.3503048    0.3581471    0.1923895 7.191911e-02 2.093433e-02

http://www.envipat.eawag.ch/index.php also agrees that the monoisotopic mass should be 1563.243363

sneumann commented 6 years ago

Comment on the code: mass of an ComposedElement is obtained in https://github.com/sneumann/Rdisop/blob/master/src/disop.cpp#L651 But I am unsure where in the C++ source this is actually calculated.

cbroeckl commented 6 years ago

Well, I have an R-only fix that will likely slow down the function, but works. I am pretty lost in the C code... I am decomposing the molecular formula using a function from the CHNOSZ package, but I am sure this can be ported over. Can't say it is an elegant fix, but this is how the calculations can be done. I will keep looking at the C code to see if I can find where this might be happening, to enable selection there.

getMolecule <- function(formula = "C49H56FeN4O6", elements = NULL, z = 0, maxisotopes=10) {
  # Use full PSE unless stated otherwise
  if (!is.list(elements) || length(elements)==0 ) {
    elements <- initializePSE()
  }

  # Remember ordering of element names,
  # but ensure list of elements is ordered
  # by mass
  element_order <- sapply(elements, function(x){x$name})
  elements <- elements[order(sapply(elements, function(x){x$mass}))]

  ## MONOISOTOPIC EXTRACTION
  mono<-sapply(1:length(elements), FUN = function(x) {
    m<-which.max(elements[[x]]$isotope$abundance)
    elements[[x]]$mass + (m-1) + elements[[x]]$isotope$mass[m]
  }
  )
  names(mono)<-sapply(1:length(elements), FUN = function(x) {elements[[x]]$name})

  # Call imslib to parse formula and calculate
  # masses and isotope pattern
  molecule <- .Call("getMolecule",
                    formula, elements, element_order,
                    z, maxisotopes,
                    PACKAGE="Rdisop")

  ## FIX MONOISOTOPIC HERE
  library('CHNOSZ')
  f<-makeup(formula)
  molecule$exactmass <- sum(mono[names(f)]*f)

  molecule
}
cbroeckl commented 6 years ago

Some observations:

  1. element.h: does not define getMass function, but line 157 appears to define the MONOISOTOPIC (MONOISOTOPIC is a typedef defined in line 52) mass. Line 169 suggests that the first isotope found with an abundance of > 0.5 is taken as the MONOISOTOPIC. It is unclear to me what happens if there are no isotopes > 0.5 abundance. Line 57 indicates that MONOISOTOPIC is initially set to 0. tests in R suggest that this is not the issue - getMolecule(formula = "C49H56FeN4O6") works, while "C89H176O16P2" does not. This indicates that the unusual Fe isotope patterns is not the source of the problem, but rather that the problem lies in the method used to calculate monoisotopic mass for a molecule It may be that the full isotope pattern is being calculated, then monoisotopic (incorrecly) assigned.

  2. in the disop.cpp file, line 248, decompositions_t appears to be selecting the first index rather than an explicit monoisotopic with this line "decomposer.getDecompositions(masses(0), error);"

  3. in the eispectraanalysis.cpp file, line 249, there is reference to: // updates molecule's isotope distribution to calculate its molecule's monoisotopic mass candidate_molecule.updateIsotopeDistribution(); this updateIsotopeDistribution function could be problematic? I haven't looked into this in depth yet.

  4. also see: "decomposer.getAllDecompositions(integer_mono_mass)" Why is there frequent reference to integer monoisotopic mass? Could this rounding be the problem? Seems highly unlikely, given that we get decimals out.

I was thinking 1. initially, but I don't think that is the problem. Rather, I suspect 2 as the most likely right now.

janlisec commented 1 month ago

This is an old discussion. However, I would like to contribute my two cents and make a suggestion to close it. :) Firstly, I dont think that Rdisop does anything wrong here, as 'exact mass' ist not an equivalent term for 'monoisotopic mass'. From the Wikipedia Link @sneumann posted above:

When an exact mass value is given without specifying an isotopic species, it normally refers to the most abundant isotopic species.

Hence, I would suggest to update the documentation to clarify this.

@cbroeckl already suggested a workaround function to compute the monoisotopic mass (MIM). While he used the CHNOSZ package, I would avoid another dependency and implement a solution to compute the MIM based on the PSE information already present in Rdisop.

Specifically, I would extend Rdisop by a new exported function getMono which accepts a vector of formulas (or alternatively a list of molecules, to be consistent with the other extractor functions) and returns the true MIMs.

@sneumann let me know if I shall prepare a PR implementing the above solution.

Bests [jan]

sneumann commented 1 month ago

Hi Jan, thanks for picking this up ! I am all for clear documentation, and also against growing dependencies where technically not necessary. Yours, Steffen

janlisec commented 1 month ago

I created a PR (v1.65.4). Once accepted, #8 could be closed.