Support for feature extraction

TheLostLambda commented 11 months ago

Hello again!

Another thing I've frequently struggled to find in the existing mass-spec toolset is support for standalone feature extraction. In my head, this would simply pull out all of the processing that Sage does before the actual ion searching and dump that processed information as a peak list with data somewhat similar to MaxQuant's allPeptides.txt.

While it seems like MaxQuant has a million and ten tricks up its sleeve, that's been part of the problem when I'm just looking for a simple list of:

Scan number
Masses (not sure if Sage does calculates a mean and standard deviation of this? From the centroiding but also throughout retention times?)
Charges (predicted from the deisotoping)
Abundances (Ion counts + Intensities, I think this is the job of LFQ?)
Retention Times (+ XIC start and end)
I suppose eventually I'd want to investigate something more for MS/MS (parent scan number, etc)

I'll admit I'm still learning a lot of this myself, and I found most of the information about "feature finding" from this video: https://www.youtube.com/watch?v=H_vClGghnNo

Even if some of the "fancier stuff" (DIA, etc) preprocessing is a bit out of scope for Sage, having a quick, embeddable (as a library) tool for converting mzML to deisotoped peak-lists would be outstanding!

Let me know what you think! Happy to help with this eventually too :)

lazear commented 11 months ago

Sage actually doesn't perform feature extraction a la MaxQuant. We take an orthogonal approach called DICE (direct ion current extraction - see FlashLFQ, IceR, IonQuant, DeMixQ, Skyline for other software that takes this approach). Feature extraction first tries to identify features (or peaks that look like peptides) from MS1 data, and then link those features to identified MS2 spectra.

DICE, on the other hand, just "blindly" extracts MS1 information for every identified MS2 spectra. If you think of the MS1 feature space as a Cartesian grid or heatmap (retention time vs m/z on axes, and intensities filling each grid), DICE equates to extracting MS1 intensities from strips of the grid - essentially overlaying rectangles (corresponding to peptide RT +/- some tolerance, peptide m/z +/- some tolerance, with multiple charges and isotopes considering) and then integrating the rectangles. This works pretty well in practice, and is very fast too (and also allows "feature" specific chromatographic alignment). Sage additionally extracts decoy DICE chromatograms to perform target-decoy competition.

As such, MS1 ions are basically not processed at all (because they don't need to be). You can of course do this yourself via the provided deisotoping code.

Hooking into the LFQ values is straightforward enough, if you want RT/abundances/etc

A graphical overview of the algorithm is below:

lazear commented 11 months ago

If you're looking for standalone feature extraction, maybe checkout OpenMS's FeatureFinder

TheLostLambda commented 11 months ago

Thanks for the clarification (and the more traditional feature-finder link)!

I'm curious to learn more about DICE! Does the figure you shared come from a paper with some more detail?

The biggest thing I'm wondering about from your explanation is what happens to MS1 ions that aren't fragmented (no MS2 data is available). A significant number of the structures we're looking at with our (what-sometimes-feels-like-dodgy) DDA instrument end up identified (initially) via MS1 only, with targeted fragmentation done to confirm those structures down the line.

It sounds as if it's possible to deisotope those MS1 ions using some of the sage library code, but would that not happen under normal operation? Are matches output for MS1-only structures?

Finally (and hopefully I can find a place to read more and not just keep pestering you), how does DICE play with DIA? Is it a big stretch to apply it there? As a disclaimer, I've read a bit about processing DIA data, but have never actually collected or dealt with it myself!

Thanks again for all of the enlightening information and patience!

lazear commented 11 months ago

That figure is one I made ;)

If the MS1 ions are never fragmented, then they will not be identified or quantified. Sage performs match-between-runs (as shown in the figure), which allows propagation of identifications across runs. If your MS1 ion was fragmented and confidently IDed in run 1, but not in run 2 or 3, an attempt will be made to quantify all 3 runs. You could extend this to unfragmented peptides if you have accurate RT info (e.g. via prediction, spectral library, previous runs), but this is somewhat out of scope for Sage.

Under normal operations, only MS2 ions are deisotoped (e.g. for high-res MS2). DICE handles isotopes at the MS1 level by explicitly trying to extract them, so there is no need to try and deisotope them in advance.

I'm not really a DIA expert myself, but my understanding is that DICE is pretty similar to how DIA quantification is performed from a spectral library (lay rectangles matching fragment ion m/z and RTs onto the grid, and integrate). AFAIK, Skyline uses a form of DICE for both MS1 and MS2 quantification.

TheLostLambda commented 6 months ago

Hi @lazear ! Not opening a new issue, because it doesn't really warrant it (you've given me a solid starting point here!), but I just wanted to congratulate you on publishing the SAGE paper and let you know that I'm now working full-time on my peptidoglycan-focused MS tool for the next four months!

I'm excited to integrate SAGE's searching capabilities into my processing pipeline and will hopefully get the chance to contribute some back to SAGE itself!

To get an idea of scope, would there be interest in adding a mode to SAGE that will report (presumably separately from the actually MS2-confirmed and scored data) MS1 ions with masses that match those found in a supplied mass library? When dealing with peptidoglycan, the molecules often have unique enough masses that MS2 isn't essential, so it's nice to still see them and do targeted fragmentation to confirm them down the line!

Whether that's in scope or out, I'm sure SAGE's library code will be an incredible place for me to start!

Look forward to hopefully collaborating a bit more in the coming months! Oh, and happy New Year!

lazear commented 6 months ago

That sounds a bit out-of-scope of Sage, if I understand correctly - you are looking to supply a list of masses and output all MS1 ions that match? You could certainly build a quick rust application utilizing Sage's internals to do so, but I don't think it makes sense (at this time) to add such functionality to the main binary

TheLostLambda commented 6 months ago

Yeah, outputting the MS1 ions (deisotoped and converted to monoisotopic masses) that seem to match, so I suppose doing a sort of very light "feature extraction" on any MS1 matches. But yeah, the more I think about that, the more I think I agree that's it's best developed separately by pulling in some SAGE library code :)

Thanks again and I'll keep you updated with how things go!

lazear / sage

Support for feature extraction #81