Closed cbroeckl closed 2 years ago
Hi Corey,
that should be doable, but we have to think how the regular expression parsing the formula could be tweaked to also find the heavy isotopes.
@jorainer Any idea?
I think we can split the formula into two parts: the first part would be all the heavy isotopes, e.g. with a regex finding all [...]
, and the remaining part. The remaining part can be treated as we do so far and the heavy isotope part needs some tweaking...
PS: To keep things simple for the moment, I would in a first version only support isotopes often used in metabolomics/lipidomics. That would be, 13C, 15N, 2H, 18O, 34S. Anything else you would need @cbroeckl ?
@michaelwitting - i think simple is fine. The isotopes you list would be a great place to start.
I am a bit surprised that heavy labelled formula are not used. Am i crazy to ask for this? i.e. Pubchem and OpenBabel - they do use heavy labels. i.e. https://pubchem.ncbi.nlm.nih.gov/compound/10241810#section=3D-Conformer . if you download the sdf and read it into R:
`> caf13c <- ChemmineR::read.SDFset('C:/Users/cbroeckl/Downloads/Conformer3D_CID_10241810.sdf')
ChemmineR::exactMassOB(caf13c) CMP1 197.0904 ChemmineR::MF(caf13c) CMP1 "C8H10N4O2" `
so the molecular weight is calculated in recognition that there are 3 13C atoms, but the formula is reported as if there were only natural abundance of 13C. Therefore if the formula becomes isolated from the structure, recalculation from the formula would give a different accurate mass result. I am also not clear that tools like Rdisop or EnviPat can use stable labelled formula to calculate isotope distributions.
I don't want to break all the R chemoinformatic tools out there, but this would seem (to me at least) to start using. If we are building compound libraries for processing these data, it would be useful to ensure the formula for those compounds can be turned into ions in a manner that reflects the fact that they are unnatural isotope distributions.
It would surely be cool to support also isotopes - let's do it stepwise. First we need to be able to correctly parse the formulas - maybe starting in countElements
. @andreavicini can you please check how we could best parse and isolate isotopes in formulas?
Question is how do we count these - at present we're counting elements, maybe we need to treat isotopes as they were their own element? i.e. countElements("[13C3]C3H12O6")
would then return:
> countElements("[13C3]C3H12O6")
$`[13C3]C3H12O6`
13C C H O
3 3 12 6
Funnily enough counting works already:
> countElements("[13C3]C3H12O6")
$`[13C3]C3H12O6`
C C H O
3 3 12 6
just the element names are not correct.
I already tried a bit around. This is my current solution:
x <- c("[13C3]C3[2H2]H10O6", "C6H12O6")
# split into isotopic and non-isotopic part ------------------------------------
split_pattern <- "(?<Element>\\[.*?\\])"
rx <- gregexpr(pattern = split_pattern, text = x, perl = TRUE)
x_noniso <- gsub(split_pattern, "", x, perl = TRUE)
# count isotopic elements ------------------------------------------------------
## regex pattern to isolate all supported isotopes
isotope_pattern <- paste0(
"(?<Element>",
"13C|",
"2H|",
"15N|",
"34S|",
"18O",
")",
"(?<Number>[0-9]*)"
)
rx <- gregexpr(pattern = isotope_pattern, text = x, perl = TRUE)
isotope <- mapply(function(xx, rr) {
n <- length(rr)
start <- attr(rr, "capture.start")
end <- start + attr(rr, "capture.length") - 1L
sbstr <- substring(xx, start, end)
## set elements without a number in the formula to one
sbstr[!nchar(sbstr)] <- 1L
sl <- seq_len(n)
nm <- sbstr[sl]
setNames(as.integer(sbstr[n + sl]), nm)
}, xx = x, rr = rx, SIMPLIFY = FALSE, USE.NAMES = TRUE)
# count non-isotopic elements --------------------------------------------------
## regex pattern to isolate all elements
element_pattern <- paste0(
"(?<Element>",
"[A][cglmrstu]|",
"[B][aehikr]?|",
"[C][adeflmnorsu]?|",
"[D][bsy]|",
"[E][rsu]|",
"[F][elmr]?|",
"[G][ade]|",
"[H][efgos]?|",
"[I][nr]?|",
"[K][r]?|",
"[L][airuv]|",
"[M][cdgnot]|",
"[N][abdehiop]?|",
"[O][gs]?|",
"[P][abdmortu]?|",
"[R][abefghnu]|",
"[S][bcegimnr]?|",
"[T][abcehilms]|",
"[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr]",
")",
"(?<Number>[0-9]*)"
)
rx <- gregexpr(pattern = element_pattern, text = x_noniso, perl = TRUE)
nonisotope <- mapply(function(xx, rr) {
n <- length(rr)
start <- attr(rr, "capture.start")
end <- start + attr(rr, "capture.length") - 1L
sbstr <- substring(xx, start, end)
## set elements without a number in the formula to one
sbstr[!nchar(sbstr)] <- 1L
sl <- seq_len(n)
nm <- sbstr[sl]
setNames(as.integer(sbstr[n + sl]), nm)
}, xx = x_noniso, rr = rx, SIMPLIFY = FALSE, USE.NAMES = TRUE)
# combine isotope and non-isotope ----------------------------------------------
I think there is a more elegant way of doing it, but it works. We just need to combine the to lists at the end.
Thanks Micheal! If possible it would however be better (and faster) if all could be done in the same loop. Maybe using e.g. "\[13C"
as a pattern?
Might be working. Just for the final list, I think it is cleaner to have 13C
instead of [13C]
. The brackets are only required in the final formula.
Isn't prepending [0-9]*
enough?
countElements <- function(x) {
## regex pattern to isolate all elements
element_pattern <- paste0(
"(?<Element>",
paste0("[0-9]*", c(
"[A][cglmrstu]|",
"[B][aehikr]?|",
"[C][adeflmnorsu]?|",
"[D][bsy]|",
"[E][rsu]|",
"[F][elmr]?|",
"[G][ade]|",
"[H][efgos]?|",
"[I][nr]?|",
"[K][r]?|",
"[L][airuv]|",
"[M][cdgnot]|",
"[N][abdehiop]?|",
"[O][gs]?|",
"[P][abdmortu]?|",
"[R][abefghnu]|",
"[S][bcegimnr]?|",
"[T][abcehilms]|",
"[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr]"),
collapse = ""
),
")",
"(?<Number>[0-9]*)"
)
rx <- gregexpr(pattern = element_pattern, text = x, perl = TRUE)
mapply(function(xx, rr) {
n <- length(rr)
start <- attr(rr, "capture.start")
end <- start + attr(rr, "capture.length") - 1L
sbstr <- substring(xx, start, end)
## set elements without a number in the formula to one
sbstr[!nchar(sbstr)] <- 1L
sl <- seq_len(n)
nm <- sbstr[sl]
setNames(as.integer(sbstr[n + sl]), nm)
}, xx = x, rr = rx, SIMPLIFY = FALSE, USE.NAMES = TRUE)
}
x <- c("[13C3]C3[2H2]H10O6", "C6H12O6")
countElements(x)
#$`[13C3]C3[2H2]H10O6`
#13C C 2H H O
# 3 3 2 10 6
#
#$C6H12O6
# C H O
# 6 12 6
Great @sgibb ! Only thing is we need somehow (maybe in a second cleanup step?) remove isotopes that don't make sense (e.g. "[45C3]C3"
). Sort of dropping counts for "45C"
and throwing a warning that this element/isotope is not supported.
Could we check against the entries in /inst/isotopes/isotope_definition.txt
?
Sure. That would be ideal.
Are you then doing a PR @sgibb or should @andreavicini work on that?
I will do.
Hello all,
i would like to be able to formally use isotopes within this rformassspectrometry environment.
https://iupac.qmul.ac.uk/sectionH/H2.html#2.2
IUPAC has some definitions for this, and it would seem the C6H12O6, with three 13C labels would be written as [13C3]C3H12O6. Would additions of [13C] as an element be feasible? Likewise [2H] - those are the most frequently encountered. I could certainly input accurate mass on my own, when needed, but i always prefer a systematic solution if possible.