rickhelmus / patRoon

Workflow solutions for mass-spectrometry based non-target analysis.
https://rickhelmus.github.io/patRoon/
GNU General Public License v3.0
58 stars 17 forks source link

generateFormulas "Object 'PLID' not found" with SIRIUS and GenForm #87

Closed drewszabo closed 11 months ago

drewszabo commented 11 months ago

Hey there Rick,

I was having trouble generating formulas with SIRIUS and GenForm. I keep receiving one of two errors that seem to be related:

Error in .checkTypos(e, names_x) : 
  Object 'PLID' not found amongst mz, intensity, precursor
In addition: Warning message:
In `[.data.table`(fi, , `:=`(PLID, sapply(mz, function(x) spec[which.min(abs(x -  :
  Column 'PLID' does not exist to remove

OR

Error in .checkTypos(e, names_x) : 
  Object 'ID' not found. Perhaps you intended intensity

I have managed to find the source of the error, when filtering the mslists object by changing the default isolatePrec rules. You can replicate the issue with the following code:

avgMSListParams <- getDefAvgPListParams(clusterMzWindow = 0.005)
precRules <- getDefIsolatePrecParams(maxIsotopes = 4)
mslists <-
  generateMSPeakLists(
    fGroups,
    "mzr",
    maxMSRtWindow = 5,
    precursorMzWindow = 4,
    avgFeatParams = avgMSListParams,
    avgFGroupParams = avgMSListParams
  )

mslists <- patRoon::filter(
  mslists,
  absMSIntThr = 1000,
  relMSMSIntThr = 0.01,
  absMSMSIntThr = 60,
  withMSMS = TRUE,
  minMSMSPeaks = 1,
  retainPrecursorMSMS = TRUE,
  isolatePrec = precRules,
  reAverage = TRUE
)

formulas <- generateFormulas(fGroups, mslists, "sirius", relMzDev = 5, adduct = "[M+H]+", elements = "CHNOPSClFBr")

Specifically, the issue only seems to appear when using the isolatePrec argument. If that is removed, then the generateFormula function will work. In my testing, this does not seem to affect the generateCompounds function for SIRIUS or MetFrag.

I have used this to help clean up the .ms files exported to SIRIUS and (maybe?) improve the performance. It certainly reduces the size of the mslists object when performing NTS with 1000s of features.

Cheers,

Drew

drewszabo commented 11 months ago

Just on the point about using the isolatePrec rule. It's hard to evaluate the exact MS1 isotopic abundance if the entire spectrum is included. Here is an example of the MS1 from simazine - I should be able to see the M+2 isotope fingerprint of Cl but its difficult to distinguish in this spectrum.

spec-MS-c27cc6c2f8ac1b4d

You see the same thing if you manually add the .ms files to SIRIUS GUI. I wonder if this makes it more difficult for the DNN algorithm that SIRIUS uses to properly evaluate the correct formula with so much noise?

rickhelmus commented 11 months ago

Hi Drew,

Many thanks for the bug report and other feedback! :-)

The bug was actually related to reAverage=TRUE. This would remove peak IDs and therefore you got the errors with formula annotation. There was also another issue that this argument was ignored when checking if cached data was available, leading to some strange situations like you saw that the precursor isolation seemed to be the issue. I just pushed some fixes. Hopefully all should be fine now.

Quite interesting that you see that the isolation of precursors may also be useful for SIRIUS! I thought that SIRIUS had its own filtering, that's why I only recommend it currently for GenForm. I am curious if you see any differences in the isoScores? If you get consistently better results it might be good to also make it default for SIRIUS.

drewszabo commented 11 months ago

Fantastic. Ill run some tests with SIRIUS with and without the precursor isolation and get back to you with the isoScore results. You might be right that SIRIUS performs its own filtering within the algorithm and does not include the other peaks in the scoring. When I manually submit the .ms files to the GUI, it does show the entire spectrum.

drewszabo commented 11 months ago

Hey Rick,

Here are some of the results from my testing. I ran SIRIUS back-to-back to try and eliminate inconsistencies with hitting their server. The biggest difference is the total run time. With fewer MS1 precursors, SIRIUS takes a fraction of the time to complete. I imagine this will scale up enormously with more features and peaks in the mslists. There are also negligible effects on the predictions. In fact, SIRIUS only correctly annotated Emamectin B1a (m/z = 886,5317) when the precursor isotopologue was isolated. Otherwise the SIRIUS results were the exact same for the 14 features (with MSMS data). Including the isoScores, which remained unchanged between using isolatePrec and without. SIRIUS must be performing their own precursor filtering but at a huge cost to compute time.

Elapsed Time With isolatePrec features 42 peaks 3495 formulasSIRIUS 31 sec compoundsSIRIUS 39.49 sec Top1 comp annotation = 11/14

Without isolatePrec features 42 peaks 70034 formulasSIRIUS 328.65 sec compoundsSIRIUS 781.65 sec Top1 comp annotation 10/14

It's not a definitive experiment by any means, but I will probably be using the isolatePrec rule moving forward for my own analysis. The time it saves me is a huge advantage, especially without impacting the annotation performance.

rickhelmus commented 11 months ago

Wow, that's awesome! Thanks for the tests! Makes me wonder if this filtering step should be done in default workflows... Something to think about ... :-)