Workaround for annotateSuspects() using SIRIUS?

drewszabo commented 2 years ago

Hi, I would like to combine compound generation results from SIRIUS, MetFrag and Library (massbank) and annotateSuspects() using default genIDLevelRulesFile. I know annotateSuspects() is only optimised for GenForm/MetFrag formula and compound generation respectively but is there any workaround that I can use in the meantime?

At least one issue seems to be the length of the ion_formula list that is generated by generateCompoundsSIRIUS(). Actually, all of the fragInfo names for each list don't really match with those created by MetFrag or Library.


> formulas <- generateFormulas(fGroupsSusp, mslists, "sirius", relMzDev = 5, adduct = "[M+H]+", elements = "CHNOPSClBrF",
>                              profile = "orbitrap", calculateFeatures = TRUE, splitBatches = TRUE, absAlignMzDev = 0.0002)
> compoundsSIR <- generateCompounds(fGroupsSusp, mslists, "sirius", relMzDev = 5,
>                                      adduct = "[M+H]+", fingerIDDatabase = "pubchem", topMost = 5,
>                                      profile = "orbitrap", elements = "CHONPSFBrCl", splitBatches = TRUE)
> compoundsSIR <- addFormulaScoring(compoundsSIR, formulas, updateScore = FALSE)
> fGroupsSIR <- annotateSuspects(fGroupsSusp, formulas = formulas, compounds = compoundsSIR, MSPeakLists = mslists)

> Error in fifelse(!is.na(ion_formula), ion_formula, annPLForms$ion_formula) : 
>   Length of 'no' is 26 but must be 1 or length of 'test' (27).

After using consensus() on the compound generation results from each algorithm, a different error is generated. I can't seem to find a source of the error in this case.


> compounds <- consensus(compoundsSIR, compoundsMF, compoundsMB)
> fGroupsSusp <- annotateSuspects(fGroupsSusp, formulas = formulas, compounds = compounds, MSPeakLists = mslists)

> Error in if (otherHighest > 0) return("not the highest score") : 
>   missing value where TRUE/FALSE needed

I have verified this issue is reproducible with the patRoonData using mostly default parameters. I'm using R v4.1.3, patRoon v2.1.0, SIRIUS v4.9.15.

I appreciate your help!

rickhelmus commented 2 years ago

Hi @drewszabo ,

Thanks for the reporting the issue.

Unfortunately(?), I cannot reproduce the first error, not even with the example data from patRoonData. Some questions that came to mind:

Did you make any modifications to the idlevelrules.yml file?
Could it be that compoundsSIR and/or formulas was generated with a different (e.g. outdated) mspeaklists? Perhaps it's best to re-run the generateMSPeakLists()/generateFormulas()/generateCompounds() calls after wiping the cache (clearCache("all")).
If that doesn't work, could you share the exact script you used with the example data?

I just pushed some changes that hopefully fix the second issue.

Thanks, Rick

drewszabo commented 2 years ago

Hey @rickhelmus

Thanks for the feedback. I was able to troubleshoot this and narrow the issue down to the screenSuspects() for specific compounds. I presume this is an error in how the function is calculating the isotopic abundance from SMILES information? Perhaps an OpenBabel intergration issue? The feature in question is "M254_R293_3478" from the patRoonData. When suspect screening is used with the compound "Sulfamethoxazole", SMILES = "CC1=CC(=NO1)NS(=O)(=O)C2=CC=C(C=C2)N", the ion_formula length is triggered. However, when the InChI for the compound is used InChI = "InChI=1S/C10H11N3O3S/c1-7-6-10(12-16-7)13-17(14,15)9-4-2-8(11)3-5-9/h2-6H,11H2,1H3,(H,12,13)", the error is not triggered.

To reproduce error:

anaInfo <- patRoonData::exampleAnalysisInfo("positive")
fList <- findFeatures(anaInfo, "openms", noiseThrInt = 1000, chromSNR = 3, chromFWHM = 5, minFWHM = 1, maxFWHM = 30)
fGroups <- groupFeatures(fList, "openms", rtalign = TRUE)
fGroups <- patRoon::filter(fGroups, preAbsMinIntensity = 100, absMinIntensity = 1E5, relMinReplicateAbundance = 1,
                  maxReplicateIntRSD = 0.75, blankThreshold = 5, removeBlanks = TRUE,
                  retentionRange = NULL, mzRange = NULL)
fGroups <- fGroups[, 1:82]

#apparent error in SMILES here
suspects <- data.frame(name = c("Sulfamethoxazole"),
                       SMILES = c("CC1=CC(=NO1)NS(=O)(=O)C2=CC=C(C=C2)N"))

fGroupsSusp <- screenSuspects(fGroups, suspects, adduct = "[M+H]+")
avgMSListParams <- getDefAvgPListParams(clusterMzWindow = 0.005)
mslists <- generateMSPeakLists(fGroupsSusp, "mzr", maxMSRtWindow = 5, precursorMzWindow = 4,
                               avgFeatParams = avgMSListParams,
                               avgFGroupParams = avgMSListParams)
mslists <- patRoon::filter(mslists, relMSMSIntThr = 0.02, topMSMSPeaks = 10)
formulas <- generateFormulas(fGroupsSusp, mslists, "sirius", relMzDev = 5, adduct = "[M+H]+", elements = "CHNOPSClBrF",
                             calculateFeatures = TRUE, splitBatches = TRUE)
compoundsSIR <- generateCompounds(fGroupsSusp, mslists, "sirius", relMzDev = 5,
                                  adduct = "[M+H]+", fingerIDDatabase = "pubchem", topMost = 5,
                                  elements = "CHONPSFBrCl", splitBatches = TRUE)
compoundsSIR <- addFormulaScoring(compoundsSIR, formulas, updateScore = TRUE)
fGroupsSIR <- annotateSuspects(fGroupsSusp, formulas = formulas, compounds = compoundsSIR, MSPeakLists = mslists)

Solution using InChI instead of SMILES:

anaInfo <- patRoonData::exampleAnalysisInfo("positive")
fList <- findFeatures(anaInfo, "openms", noiseThrInt = 1000, chromSNR = 3, chromFWHM = 5, minFWHM = 1, maxFWHM = 30)
fGroups <- groupFeatures(fList, "openms", rtalign = TRUE)
fGroups <- patRoon::filter(fGroups, preAbsMinIntensity = 100, absMinIntensity = 1E5, relMinReplicateAbundance = 1,
                  maxReplicateIntRSD = 0.75, blankThreshold = 5, removeBlanks = TRUE,
                  retentionRange = NULL, mzRange = NULL)
fGroups <- fGroups[, 1:82]

#working solution using InChI
suspects <- data.frame(name = c("Sulfamethoxazole"),
                       InChI= c("InChI=1S/C10H11N3O3S/c1-7-6-10(12-16-7)13-17(14,15)9-4-2-8(11)3-5-9/h2-6H,11H2,1H3,(H,12,13)"))

fGroupsSusp <- screenSuspects(fGroups, suspects, adduct = "[M+H]+")
avgMSListParams <- getDefAvgPListParams(clusterMzWindow = 0.005)
mslists <- generateMSPeakLists(fGroupsSusp, "mzr", maxMSRtWindow = 5, precursorMzWindow = 4,
                               avgFeatParams = avgMSListParams,
                               avgFGroupParams = avgMSListParams)
mslists <- patRoon::filter(mslists, relMSMSIntThr = 0.02, topMSMSPeaks = 10)
formulas <- generateFormulas(fGroupsSusp, mslists, "sirius", relMzDev = 5, adduct = "[M+H]+", elements = "CHNOPSClBrF",
                             calculateFeatures = TRUE, splitBatches = TRUE)
compoundsSIR <- generateCompounds(fGroupsSusp, mslists, "sirius", relMzDev = 5,
                                  adduct = "[M+H]+", fingerIDDatabase = "pubchem", topMost = 5,
                                  elements = "CHONPSFBrCl", splitBatches = TRUE)
compoundsSIR <- addFormulaScoring(compoundsSIR, formulas, updateScore = TRUE)
fGroupsSIR <- annotateSuspects(fGroupsSusp, formulas = formulas, compounds = compoundsSIR, MSPeakLists = mslists)

I am trying to run screenSuspects() with the entire NORMAN Merged Suspect List (SusDat S0). I have tried to clean the SMILES, InChI and InChIKey variables to allow it to work with patRoon, however I have yet to find a good solution for this.

schymane commented 2 years ago

It's strange that the SMILES causes an error, it depicts fine in CDK.

The entire merged SusDat is an extremely big list to use as a suspect list (it's approx 100K entries) - it is perhaps more suitable as a database rather than a suspect list (too many entries have the same masses). But in general if you wish to "clean up" suspect lists to be compatible with patRoon, Bego wrote an RMarkdown document explaining how she did this for some of her suspect lists: https://gitlab.lcsb.uni.lu/begona.talavera/pd-cheminformatics-pipeline/-/blob/main/PD_chem_curation/PD_chem_curation.Rmd

If you take sets / subsets of the NORMAN lists from the NORMAN-SLE browser on the PubChem website, you can send them directly to ID Exchange using their search functionality (and perform part of the operations Bego described in that RMarkdown).

https://pubchem.ncbi.nlm.nih.gov/classification/#hid=101

rickhelmus commented 2 years ago

Hi @drewszabo ,

Excellent, many thanks for the examples and bug hunting. I just pushed two more fixes, which should hopefully fix the issues you are seeing.

Some background: The ion_formula error came from the fact that SIRIUS adds zero intensity precursor peaks to the annotation results when the precursor peak is absent in the input MS/MS peak list. As a result, the 'additional' peak in the annotation data (fragInfo) caused errors when it was related to the corresponding MS/MS peak list. Note that the SIRIUS formula code already removed these zero intensity peaks, hence, only the compound annotations messed things up.

The reason why it 'worked' with InChI data and missing SMILES in the suspect list was even less obvious: there was a small error in the code that prevented missing formula data to be calculated from InChIs, which in turn meant no formula annotations could be matched, which in turn didn't trigger the above bug ... ;-)

I personally never tried such a big suspect list. Hopefully with these fixes and the info @schymane shared (thanks!!) it will work, and I would be curious how it performs.

PS: you still need to spend some time tweaking the idlevelrules.yml file for SIRIUS scores. If you find some good rules I would be quite happy to include them in patRoon!

Thanks, Rick

drewszabo commented 2 years ago

Hey @rickhelmus

I'm trying to update the idlevelrules.yml to include SIRIUS scores. I'm having a bit of trouble identifying variables that are available in the YAML format. For example, I would like to set ID levels based on the number of fragments explained by the generateCompoundsSIRIUS results. I'd appreciate if you could head me in the right direction, here are a few examples of what I have tried.

YAML file:

2b:
    minExplainedPeaks:
        min: 3
        type: compound

2b:
    explainedPeaks:
        min: 3
        type: compound

I have tried a few variables in the class compound > group features container without any luck. I have this error returning in each case:

Error in checkLevelOK(IDLevelRules[[lvl]]) : 
  Unknown ID level type: minExplainedPeaks

And thanks @schymane ! I'll look into this markdown, very interesting. Once I cleaned up the SusDat S0 list, I have roughly 52,600 compounds that seem to have been loaded into patRoon ok with InChI (I haven't tried Rick's recent patch for SMILES yet). As you can see from above, I'd really like to start annotating suspects using the number of fragments matched with either SIRIUS or MetFrag to help with the confidence levels - despite the number of exact masses that are returned.

rickhelmus commented 2 years ago

Hi @drewszabo

At the moment, explainedPeaks is not part of the possible types... but this sounds like a nice addition to the rule system.

However, the annMSMSSim type that the default ID rules use has similar purposes, and expresses the match between measured and predicted fragments (1.0 = 100% match). Another approach might be to filter the compounds prior to suspect annotation, for instance, by removing all candidates with <2 explained peaks:

compounds <- filter(compounds, minExplainedPeaks = 2)

If the suspect compound has <2 explained peaks it is removed from the compound candidates, hence, the suspect ID level can then only be level 4/5.

drewszabo commented 2 years ago

That sounds like a reasonable solution! Ill close this issue now.

Thanks!

rickhelmus / patRoon

Workaround for annotateSuspects() using SIRIUS? #54