rformassspectrometry / CompoundDb

Creating and using (chemical) compound databases
https://rformassspectrometry.github.io/CompoundDb/index.html
17 stars 16 forks source link

Add fields adduct and msLevel to .extract_spectra_mona_sdf() #72

Closed jmbadia closed 3 years ago

jmbadia commented 3 years ago

MoNa provides the adduct (field PRECURSOR TYPE) and msLevel (field SPECTRUM TYPE) for MS2 spectra (issue #30). I think it would be a good idea to add these field to .extract_spectra_mona_sdf(). I'll do it if that is ok for you

Please also note that smiles and splash don't have a particular field on the sdf file, but they appear on a regular basis (>99.88% of the negative MS/MS sdf file) in the COMMENT field (along with other variables, separated by a "__" character). It seems that somehow the algorithm that converts the data to the sdf format put consciously all the available variables in the COMMENT field. Maybe it would be a good idea to parse smiles and splash.

sample of COMMENT file: "SMILES=c1c(cc(c(c1Cl)n2c(c(c(n2)C#N)S(=O)C(F)(F)F)N)Cl)C(F)(F)F cas=120068-37-3 chebi=83394 kegg=C11099 pubchem cid=3352 chemspider=3235 InChI=InChI=1S/C12H4Cl2F6N4OS/c13-5-1-4(11(15,16)17)2-6(14)8(5)24-10(22)9(7(3-21)23-24)26(25)12(18,19)20/h1-2H,22H2 __ computed SMILE ....."

jorainer commented 3 years ago

Hi @jmbadia (sorry for the late reply), yes, would be great if you could add these info and make a pull request

thanks!

jmbadia commented 3 years ago

great !

jmbadia commented 3 years ago

Retaking the issue... I can not parse adduct=PRECURSOR TYPE and msLevel= SPECTRUM TYPE fields. Such terms are only analogous under MSn spectrometry. The best option here is to keep the original field names and indicate that the user can rename those columns to adduct and msLevel respectively. what do you think @jorainer?

jorainer commented 3 years ago

sounds good to me. I did something similar for parsing/accessing the MassBank database. The percursor m/z is not always a numeric in MassBank, sometimes it is a character string or has multiple values. I'm thus storing the precursor m/z in a field called "precursor_mz_text" as it is retrieved from MassBank (as a text string) and in addition convert it to a numeric with as.numeric and use that as precursorMz - it will be correct for those entries that have a numeric precursor m/z but will have NA for all others (in which case the user can still get the original information from the $precursor_mz_text spectra variable.

jmbadia commented 3 years ago

Perfect. I'll replicate your MassBank solution :)

jmbadia commented 3 years ago

link to pull request #81 and issue #80