lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
224 stars 44 forks source link

Support for SPS-MS3 quantification #68

Closed radusuciu closed 1 year ago

radusuciu commented 1 year ago

Hi, I recently started working on putting together a TMT pipeline and am evaluating Sage as well as Comet/OpenMS/custom approach. My question has to do with processing SPS-MS3 data. I can see you've got this in the works but wanted to open a discussion after encountering an error when attempting to process a file using IsobaricAnalyzer.

Would sage also have issues with quantifying data from these mzMLs which lack isolation window offsets for SPS masses? Here's an example:

<isolationWindow>
  <cvParam cvRef="MS" accession="MS:1000827" value="577.298034667969" name="isolation window target m/z" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
  <cvParam cvRef="MS" accession="MS:1000828" value="-0.5" name="isolation window lower offset" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
  <cvParam cvRef="MS" accession="MS:1000829" value="-0.5" name="isolation window upper offset" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
</isolationWindow>

Thank you for your continued work on sage!

lazear commented 1 year ago

I'm pretty sure I used ThermoRawFileParser for SPS-MS3 when I wrote that code... Lemme dig into a couple files and see if I see the same isolation window defect (that does appear to be a bug/invalid behavior from TRFP, imo).

I could certainly add in a check for this behavior and override it (to some reasonable value, say -1.2 to 1.2 m/z?). I can also resurrect the SPS-MS3 code. I commented it out a while back when I switched from using integer scan numbers to including the full spectrum title (e.g. "controllerType=0 scan=3169") - this is used to lookup the proper MS2 scan.

Note that the commented out code was just for calculating SPS precursor purity - Sage will currently quantify SPS-MS3 data, but will not report SPS purity. Shouldn't be any issue with these files, since the isolation window is only needed to calculate purity. I also use TRFP for SPS-MS3 data all the time.

radusuciu commented 1 year ago

Thank you for the reply and for confirming quantification of SPS-MS3 should be fine -- lack of SPS precursor purity isn't a big deal!

lazear commented 1 year ago

Lemme know what you find! Definitely want to make TMT workflows as good as possible.

FYI, I've had good success setting fragment_tol much lower than what is frequently used in the literature (0.6 Da -> ~300 ppm)

radusuciu commented 1 year ago

Hey, so here's a question. Given the below config.json, sage (version 0.13.1, hot off the press) finishes searching in 41 seconds and outputs reasonable looking files, but according to the output there are 0 PSMs/peptides/proteins at 1% FDR (see full output below).

Am I misconfiguring something?

{
    "database": {
        "bucket_size": 16384,
        "fragment_min_mz": 150.0,
        "fragment_max_mz": 1500.0,
        "enzyme": {
            "missed_cleavages": 1,
            "cleave_at": "KR",
            "restrict": "P"
        },
        "static_mods": { 
            "^": 304.207,
            "K": 304.207,
            "C": 57.0215 
        },
        "variable_mods": { 
            "M": 15.9949
        },
        "decoy_tag": "rev_",
        "generate_decoys": true,
        "fasta": "UP000005640_9606.fasta"
    },
    "output_directory": "output_tmt18_literature",
    "quant": {
        "tmt": "Tmt18",
        "tmt_settings": {
            "level": 3,
            "sn": false
        },
        "lfq": false
    },
    "deisotope": false,
    "chimera": false,
    "max_fragment_charge": 1,
    "report_psms": 1,
    "precursor_tol": {
        "ppm": [
            -50,
            50
        ]
    },
    "fragment_tol": {
        "ppm": [
            -10,
            10
        ]
    },
    "isotope_errors": [
        -1,
        3
    ],
    "mzml_paths": [
        "thermorawfileparser/1.4.2/ea05148_02042021_T18_F1.mzML"
    ]
}

Output:

[2023-05-19T00:59:47Z INFO  sage] generated 60073946 fragments, 4277706 peptides in 8553ms
[2023-05-19T00:59:47Z INFO  sage] processing files 0 .. 1 
[2023-05-19T01:00:11Z INFO  sage]  - file IO:    15011 ms
[2023-05-19T01:00:11Z INFO  sage]  - search:      9035 ms (49964 spectra)
[2023-05-19T01:00:11Z INFO  sage_core::ml::retention_alignment] aligning file #0: y = 1.0000x + 0.0000
[2023-05-19T01:00:11Z INFO  sage_core::ml::retention_alignment] aligned retention times across 1 files
[2023-05-19T01:00:11Z INFO  sage_core::ml::retention_model] - fit retention time model, rsq = NaN
[2023-05-19T01:00:19Z INFO  sage] discovered 0 target peptide-spectrum matches at 1% FDR
[2023-05-19T01:00:19Z INFO  sage] discovered 0 target peptides at 1% FDR
[2023-05-19T01:00:19Z INFO  sage] discovered 0 target proteins at 1% FDR

File downloaded from this dataset: https://www.ebi.ac.uk/pride/archive/projects/PXD024275 Full URL: https://ftp.pride.ebi.ac.uk/pride/data/archive/2021/04/PXD024275/ea05148_02042021_T18_F1.raw Converted with ThermoRawFileParser 1.4.2

Thanks!

lazear commented 1 year ago

We used a 50-ppm precursor ion tolerance and 0.9 Da product ion tolerance for MS2 scans collected in the ion trap and 0.02 Da product ion tolerance for MS2 scans collected in the Orbitrap.

10ppm fragment tolerance is probably just a tad too narrow here 😉 Perhaps start with 0.9 Da and then move it down? I like to make a histogram of MS2 fragment tolerance vs decoys and see if it can be generally tighter (sns.displot(data=df, x='fragment_ppm', hue='label')). I usually find that (on our instruments at least) even 0.6 Da is wider than necessary for searching MS3 datasets, and can produce sub-optimal results.

lazear commented 1 year ago

Using 10ppm, I also get 0 PSMs for that file.

With 0.9 Da:

[2023-05-19T01:43:24Z INFO  sage] discovered 5980 target peptide-spectrum matches at 1% FDR
[2023-05-19T01:43:24Z INFO  sage] discovered 5235 target peptides at 1% FDR
[2023-05-19T01:43:24Z INFO  sage] discovered 3123 target proteins at 1% FDR

With 300 ppm:

[2023-05-19T01:45:30Z INFO  sage] discovered 6920 target peptide-spectrum matches at 1% FDR
[2023-05-19T01:45:30Z INFO  sage] discovered 6040 target peptides at 1% FDR
[2023-05-19T01:45:30Z INFO  sage] discovered 3467 target proteins at 1% FDR

You might get difference results, given that we are using slightly different fasta files.

radusuciu commented 1 year ago

Makes sense, thank you for pointing that out!