lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
201 stars 38 forks source link

Request for c/z ions detection #55

Closed jamesknight21 closed 1 year ago

jamesknight21 commented 1 year ago

Dear Mike,

Absolutely FANTASTIC resource, and thanks for keeping it open!

Quick questions:

  1. Where would I define the parameter for the inclusion of C/Z fragment ions? Say, I did EtHCD experiment and wish to search for the presence of C/Z and B/Y ions.
  2. Would you be able to provide a template of the config/json file where we can find out the various param names available to us?

Many thanks, James

lazear commented 1 year ago

Hi James,

  1. At the moment, there is only code for generating b/y fragment ions - I don't think there are any blockers preventing the addition of A/C/X/Z ions. I'm open to the idea of adding support for this, let me identify some benchmark datasets and do some testing.
  2. There is an example configuration file at the very bottom of the README: https://github.com/lazear/sage#example-configuration-file - one of my short-term goals for the project is to improve documentation, so if there is something that is not clear, please let me know!

Best, Mike

jamesknight21 commented 1 year ago

Thanks for the prompt reply!

I tried it a few minutes ago... and it is BLAZING fast! When I ran it, I assumed it crashed (with some random error)... but no... it was executed properly and finished misleadingly fast!

  1. Many thanks for being willing to invest time in implementing ABC/XYZ ions. It would be tremendously useful for mixed or multi-fragmentation acquisitions. I have this chemical modification study, and Sage seems ideal for it because of its speed to rapidly screen through many datasets where we are trying different permutations of a method! With some of our shorter gradients (10-mins etc), we can now make acquisition decisions for the subsequent run within 2-3 minutes of completion!

  2. Yup, I saw that config file; I just wanted to be sure if that was the template we should be using :)

Couple of new question:

  1. Of course, the search speed is very impressive, and the report (results.sage.tsv) is also intuitive. However, out of curiosity, I was wondering about how you are performing MS1 quant for the peptides. As in aggregation of the intensities of various charge states etc...?

  2. All the parameters in your config.json were intuitive to understand. However, can you please expand on the isotope_errors parameter? I maybe have a wrong understanding of it.

    "isotope_errors": [       // Optional[Tuple[int, int]] {default=[0,0]}: C13 isotopic envelope to consider for precursor
    -1,                     // Consider -1 C13 isotope
    3                       // Consider up to +3 C13 isotope (-1/0/1/2/3) 
    ],
lazear commented 1 year ago

I'm glad you are enjoying it! We also get a lot of use out of being able to rapidly search large datasets and try out permutations.

Hold off on using MS1 quant for production use for another week or so 😄 - I released it as an "experimental" feature, but the quantitative performance is quite poor (relatively low R2, no match-between runs). I have actually been completely rewriting the MS1 quant module from scratch (https://github.com/lazear/sage/compare/master...alt-lfq) and will be releasing it in the next week or so. The rewritten LFQ module rivals IonStar (and I believe MaxQuant) in terms of quantitative performance (and of course, it is still ludicrously fast).

The current algorithm aggregates intensities from various charge states and isotopologues, and calculates the region in a narrow RT & mass-tolerance window most likely to correspond to the precursor by a kernel density estimate of ion counts scaled by intensity.

The new algorithm still considers various charges states (2-4) and M+0 to M+3 isotopes within a narrow RT and mass-tolerance window (direct ion current extraction, a la IonStar, FlashLFQ, Skyline). I have added a hybrid (global regression + local correlation optimized run offsets on a per-feature basis) retention time alignment algorithm that enables match-between runs, as well as scoring MS1 peaks by normalized spectral angle relative to the expected isotopic distribution profile. This dramatically improves quantitative performance on ground truth datasets. You are the first person to get to see some new results:

image

Compare this to the performance of the current algorithm: image

This dataset has E.coli proteins spiked in to human lysates at 1.5x, 2x, 2.5x, and 3x concentrations. The new algorithm in Sage accurately extracts precise quantitative ratios for peptides and proteins. image

  1. The isotopic errors essentially adds a mass offset of 1 C13 neutron to the precursor_tol window. In the case of 50 ppm tolerance, Sage will search at (-50, 50), (+1 C13 -50ppm, +50ppm), (-1 C13 -50ppm, +50ppm), (-2 C13 -50ppm, +50ppm), etc applied to the annotated MS1 ion m/z. This can aid in identification of peptides where the monoisotopic peak is misannotated in the mzML file (e.g. sometimes the M+1 isotopologue is called as the monoisotopic peak). Alternatively, you could search with a (-3.5, +1.25 Da)
jamesknight21 commented 1 year ago

Dear Mike,

Thank you for sharing and keeping the development process transparent!!!

It is very exciting to see that you are able to keep it performant while improving quantification accuracy! I will continue testing it and provide you with some more requests and feedback. FYI, I have already implemented it on our HPC and it blazed through ~800 files.

Sincerely, James

jamesknight21 commented 1 year ago

Sorry, I accidentally closed it.

lazear commented 1 year ago

Hi James, I've just added support for additional fragment ion kinds - note that this hasn't been "officially" released yet - I haven't had a change to test it out on any EtHCD experiments yet. If you are interested, compiling the latest commit from the master branch would enable you to test this out

{
    "database": {
      "enzyme": {
        "missed_cleavages": 1,
        "min_len": 7,
        "max_len": 30,
        "cleave_at": "KR",
        "restrict": "P"
      },
      "fragment_min_mz": 150.0,
      "fragment_max_mz": 1500.0,
      "peptide_min_mass": 500.0,
      "peptide_max_mass": 5000.0,
      "ion_kinds": ["x", "b", "y"],
   }
...

Currently, the implementation for scoring of fragment ions treats (a/b/c) and (x/y/z) ions as groups. Any a-/c- ions will be counted as b-ions for the purposes of scoring (matched_b, longest_b features for PSM rescoring), and likewise for x-/y-/z-. I have never analyzed any EtHCD data, and I'm not sure how other search engines handle this - does this seem like a reasonable approach, or would you prefer something else?

lazear commented 1 year ago

I was able to test out an EThCD file from MSV000080008 (I used 20151216_07_HeLa_EThCD.mzML). We get excellent agreement with Comet when searching for b/c/y/z ions with ADP-ribosylation as a variable mod. I will roll out a new release build (v0.11.1) which should make it easier to test out.

image
lazear commented 1 year ago

I am going to close this as completed, please feel free to reopen the issue if needed!