heavy isotope recognition

cbroeckl commented 2 years ago

Hello all,

i would like to be able to formally use isotopes within this rformassspectrometry environment.

https://iupac.qmul.ac.uk/sectionH/H2.html#2.2

IUPAC has some definitions for this, and it would seem the C6H12O6, with three 13C labels would be written as [13C3]C3H12O6. Would additions of [13C] as an element be feasible? Likewise [2H] - those are the most frequently encountered. I could certainly input accurate mass on my own, when needed, but i always prefer a systematic solution if possible.

michaelwitting commented 2 years ago

Hi Corey,

that should be doable, but we have to think how the regular expression parsing the formula could be tweaked to also find the heavy isotopes.

@jorainer Any idea?

I think we can split the formula into two parts: the first part would be all the heavy isotopes, e.g. with a regex finding all [...], and the remaining part. The remaining part can be treated as we do so far and the heavy isotope part needs some tweaking...

michaelwitting commented 2 years ago

PS: To keep things simple for the moment, I would in a first version only support isotopes often used in metabolomics/lipidomics. That would be, 13C, 15N, 2H, 18O, 34S. Anything else you would need @cbroeckl ?

cbroeckl commented 2 years ago

@michaelwitting - i think simple is fine. The isotopes you list would be a great place to start.

I am a bit surprised that heavy labelled formula are not used. Am i crazy to ask for this? i.e. Pubchem and OpenBabel - they do use heavy labels. i.e. https://pubchem.ncbi.nlm.nih.gov/compound/10241810#section=3D-Conformer . if you download the sdf and read it into R:

`> caf13c <- ChemmineR::read.SDFset('C:/Users/cbroeckl/Downloads/Conformer3D_CID_10241810.sdf')

ChemmineR::exactMassOB(caf13c) CMP1 197.0904 ChemmineR::MF(caf13c) CMP1 "C8H10N4O2" `

so the molecular weight is calculated in recognition that there are 3 13C atoms, but the formula is reported as if there were only natural abundance of 13C. Therefore if the formula becomes isolated from the structure, recalculation from the formula would give a different accurate mass result. I am also not clear that tools like Rdisop or EnviPat can use stable labelled formula to calculate isotope distributions.

I don't want to break all the R chemoinformatic tools out there, but this would seem (to me at least) to start using. If we are building compound libraries for processing these data, it would be useful to ensure the formula for those compounds can be turned into ions in a manner that reflects the fact that they are unnatural isotope distributions.

jorainer commented 2 years ago

It would surely be cool to support also isotopes - let's do it stepwise. First we need to be able to correctly parse the formulas - maybe starting in countElements. @andreavicini can you please check how we could best parse and isolate isotopes in formulas?

Question is how do we count these - at present we're counting elements, maybe we need to treat isotopes as they were their own element? i.e. countElements("[13C3]C3H12O6") would then return:

> countElements("[13C3]C3H12O6")
$`[13C3]C3H12O6`
13C   C   H   O 
  3   3  12   6

Funnily enough counting works already:

> countElements("[13C3]C3H12O6")
$`[13C3]C3H12O6`
 C  C  H  O 
 3  3 12  6

just the element names are not correct.

michaelwitting commented 2 years ago

I already tried a bit around. This is my current solution:

x <- c("[13C3]C3[2H2]H10O6", "C6H12O6")

# split into isotopic and non-isotopic part ------------------------------------
split_pattern <- "(?<Element>\\[.*?\\])"
rx <- gregexpr(pattern = split_pattern, text = x, perl = TRUE)
x_noniso <- gsub(split_pattern, "", x, perl = TRUE)

# count isotopic elements ------------------------------------------------------
## regex pattern to isolate all supported isotopes
isotope_pattern <- paste0(
  "(?<Element>",
  "13C|",
  "2H|",
  "15N|",
  "34S|",
  "18O",
  ")",
  "(?<Number>[0-9]*)"
)

rx <- gregexpr(pattern = isotope_pattern, text = x, perl = TRUE)

isotope <- mapply(function(xx, rr) {
  n <- length(rr)
  start <- attr(rr, "capture.start")
  end <- start + attr(rr, "capture.length") - 1L
  sbstr <- substring(xx, start, end)
  ## set elements without a number in the formula to one
  sbstr[!nchar(sbstr)] <- 1L
  sl <- seq_len(n)
  nm <- sbstr[sl]
  setNames(as.integer(sbstr[n + sl]), nm)
}, xx = x, rr = rx, SIMPLIFY = FALSE, USE.NAMES = TRUE)

# count non-isotopic elements --------------------------------------------------
## regex pattern to isolate all elements
element_pattern <- paste0(
  "(?<Element>",
  "[A][cglmrstu]|",
  "[B][aehikr]?|",
  "[C][adeflmnorsu]?|",
  "[D][bsy]|",
  "[E][rsu]|",
  "[F][elmr]?|",
  "[G][ade]|",
  "[H][efgos]?|",
  "[I][nr]?|",
  "[K][r]?|",
  "[L][airuv]|",
  "[M][cdgnot]|",
  "[N][abdehiop]?|",
  "[O][gs]?|",
  "[P][abdmortu]?|",
  "[R][abefghnu]|",
  "[S][bcegimnr]?|",
  "[T][abcehilms]|",
  "[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr]",
  ")",
  "(?<Number>[0-9]*)"
)

rx <- gregexpr(pattern = element_pattern, text = x_noniso, perl = TRUE)

nonisotope <- mapply(function(xx, rr) {
  n <- length(rr)
  start <- attr(rr, "capture.start")
  end <- start + attr(rr, "capture.length") - 1L
  sbstr <- substring(xx, start, end)
  ## set elements without a number in the formula to one
  sbstr[!nchar(sbstr)] <- 1L
  sl <- seq_len(n)
  nm <- sbstr[sl]
  setNames(as.integer(sbstr[n + sl]), nm)
}, xx = x_noniso, rr = rx, SIMPLIFY = FALSE, USE.NAMES = TRUE)

# combine isotope and non-isotope ----------------------------------------------

I think there is a more elegant way of doing it, but it works. We just need to combine the to lists at the end.

jorainer commented 2 years ago

Thanks Micheal! If possible it would however be better (and faster) if all could be done in the same loop. Maybe using e.g. "\[13C" as a pattern?

michaelwitting commented 2 years ago

Might be working. Just for the final list, I think it is cleaner to have 13C instead of [13C]. The brackets are only required in the final formula.

sgibb commented 2 years ago

Isn't prepending [0-9]* enough?

countElements <- function(x) {
    ## regex pattern to isolate all elements
    element_pattern <- paste0(
        "(?<Element>",
            paste0("[0-9]*", c(
                "[A][cglmrstu]|",
                "[B][aehikr]?|",
                "[C][adeflmnorsu]?|",
                "[D][bsy]|",
                "[E][rsu]|",
                "[F][elmr]?|",
                "[G][ade]|",
                "[H][efgos]?|",
                "[I][nr]?|",
                "[K][r]?|",
                "[L][airuv]|",
                "[M][cdgnot]|",
                "[N][abdehiop]?|",
                "[O][gs]?|",
                "[P][abdmortu]?|",
                "[R][abefghnu]|",
                "[S][bcegimnr]?|",
                "[T][abcehilms]|",
                "[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr]"),
                collapse = ""
            ),
        ")",
        "(?<Number>[0-9]*)"
    )
    rx <- gregexpr(pattern = element_pattern, text = x, perl = TRUE)

    mapply(function(xx, rr) {
        n <- length(rr)
        start <- attr(rr, "capture.start")
        end <- start + attr(rr, "capture.length") - 1L
        sbstr <- substring(xx, start, end)
        ## set elements without a number in the formula to one
        sbstr[!nchar(sbstr)] <- 1L
        sl <- seq_len(n)
        nm <- sbstr[sl]
        setNames(as.integer(sbstr[n + sl]), nm)
    }, xx = x, rr = rx, SIMPLIFY = FALSE, USE.NAMES = TRUE)
}

x <- c("[13C3]C3[2H2]H10O6", "C6H12O6")
countElements(x)
#$`[13C3]C3[2H2]H10O6`
#13C   C  2H   H   O
#  3   3   2  10   6
#
#$C6H12O6
# C  H  O
# 6 12  6

jorainer commented 2 years ago

Great @sgibb ! Only thing is we need somehow (maybe in a second cleanup step?) remove isotopes that don't make sense (e.g. "[45C3]C3"). Sort of dropping counts for "45C" and throwing a warning that this element/isotope is not supported.

sgibb commented 2 years ago

Could we check against the entries in /inst/isotopes/isotope_definition.txt ?

jorainer commented 2 years ago

Sure. That would be ideal.

jorainer commented 2 years ago

Are you then doing a PR @sgibb or should @andreavicini work on that?

sgibb commented 2 years ago

I will do.

rformassspectrometry / MetaboCoreUtils

heavy isotope recognition #53