rickhelmus / patRoon

Workflow solutions for mass-spectrometry based non-target analysis.
https://rickhelmus.github.io/patRoon/
GNU General Public License v3.0
61 stars 18 forks source link

isolatePrec argument filtering MSMS peaks #56

Closed drewszabo closed 1 year ago

drewszabo commented 1 year ago

Hey Rick,

I'm trying to reduce the complexity (and file size) of my mslists by using the isolatePrec argument in patRoon::filter(mslists, ...). However, I have found that it actually isolates the precursor in both MS and MSMS peak lists, including the averagedPeakList. I wonder if this was intended, and if there is a way to only filter the MS lists alone, leaving the complete MSMS list for further analysis. Perhaps by using getDefIsolatePrecParams()?

Perhaps you can tell me if having a large MS peak list is adding any compute time to my generateFormulasSIRIUS() or generateCompoundsSIRIUS()? It would be great to reduce my compute time too.

Cheers,

drewszabo commented 1 year ago

On the file size problem. In my project, I have >9000 features in fGroups. The mslists ends up with almost 8mil elements after filtering and a file size of 1.8GB. For features with higher m/z, the MS list is enormous, but I only require the precursor and isotopes for analysis. This takes a substantial amount of time to save and load the mslists object, I suppose this is a Windows single-threaded file system thing. Anything to help the process would be great.

rickhelmus commented 1 year ago

Hi Drew,

Many thanks for bringing this up, it seems you caught a recent regression, and I just pushed out a fix so that only MS data is filtered again.

For the size of peak lists: personally I always try to prioritize the features as much as possible before going to any of the annotation steps. (You have to be a bit inventive sometimes with this, and it can be quite specific to the type of data and study you are working with.) But if there are almost 10k feature groups I can imagine you end up with a large object. There is of course also the possibility to apply other filter steps, usually I apply the topMost filter and perhaps some relative minimum intensity. Did you already apply any of these? You could also think of applying the annotatedBy with formula annotation data, which may help a bit with subsequent compound annotation.

I am not sure how much time 'rich' MS/MS data will add to SIRIUS, but my feeling is that other steps (eg retrieving data from CSI) may take more time.

Thanks, Rick

drewszabo commented 1 year ago

Thanks for the fix.

And yes, I have been experimenting with different filters to reduce the number of features. I am having trouble with noisy MS peaks getting through my initial filters. I am going to try and run the extracted features through the MetaClean and NeatMS ML-approaches next (https://github.com/bihealth/NeatMS/). NeatMS has the advantage of being pre-trained and has three categories, compared to MetaClean's 2 categories.

Closing off the issue. Thanks, DS