lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
210 stars 39 forks source link

DIA from timsconvert not working #150

Closed animesh closed 2 months ago

animesh commented 2 months ago

I am trying to process couple of timsTOF-pro raw-files from MaxQuant analysis using timscovert-ed data followed by sage from ghcr.io/lazear/sage:v0.14.7

docker run --rm -it -v Z:/HeLa:/data ghcr.io/lazear/sage:v0.14.7
root@5885dd6c4980:/app#./sage /data/sage.json -f /data/human_crap.fasta -o /data /data/26june24_hel200_100spd_OT_1ulirt_S2-F2_1_6368.mzML /data/26june24_hel200_100spd_OT_1ulirt_S2-G2_1_6369.mzML
[2024-07-30T11:59:42Z INFO  sage] generated 135193262 fragments, 5604466 peptides in 39861ms
[2024-07-30T11:59:42Z INFO  sage] processing files 0 .. 2 
[2024-07-30T12:04:10Z INFO  sage] - file IO:   267812 ms
[2024-07-30T12:04:41Z INFO  sage] - search:     30939 ms (1458 spectra/s)
[2024-07-30T12:04:41Z INFO  sage_core::ml::retention_alignment] aligning file #0: y = 1.0000x + 0.0000
[2024-07-30T12:04:41Z INFO  sage_core::ml::retention_alignment] aligning file #1: y = 1.0000x + 0.0000
[2024-07-30T12:04:41Z INFO  sage_core::ml::retention_alignment] aligned retention times across 2 files
[2024-07-30T12:04:41Z INFO  sage_core::ml::retention_model] - fit retention time model, rsq = NaN
[2024-07-30T12:04:41Z INFO  sage_core::ml::mobility_model] - fit mobility model, rsq = NaN, mse = NaN
[2024-07-30T12:05:18Z INFO  sage_core::lfq] tracing MS1 features
[2024-07-30T12:05:21Z INFO  sage_core::lfq] integrating MS1 features
[2024-07-30T12:05:21Z INFO  sage] discovered 302 target MS1 peaks at 5% FDR
[2024-07-30T12:05:21Z INFO  sage] discovered 4087 target peptide-spectrum matches at 1% FDR
[2024-07-30T12:05:21Z INFO  sage] discovered 407 target peptides at 1% FDR
[2024-07-30T12:05:21Z INFO  sage] discovered 340 target proteins at 1% FDR
{
  "version": "0.14.6",
  "database": {
    "bucket_size": 8192,
    "enzyme": {
      "missed_cleavages": 2,
      "min_len": 7,
      "max_len": 50,
      "cleave_at": "KR",
      "restrict": "P",
      "c_terminal": null,
      "semi_enzymatic": null
    },
    "fragment_min_mz": 150.0,
    "fragment_max_mz": 2000.0,
    "peptide_min_mass": 500.0,
    "peptide_max_mass": 5000.0,
    "ion_kinds": [
      "b",
      "y"
    ],
    "min_ion_index": 2,
    "static_mods": {},
    "variable_mods": {},
    "max_variable_mods": 3,
    "decoy_tag": "rev_",
    "generate_decoys": true,
    "fasta": "/data/human_crap.fasta"
  },
  "quant": {
    "tmt": null,
    "tmt_settings": {
      "level": 3,
      "sn": false
    },
    "lfq": true,
    "lfq_settings": {
      "peak_scoring": "Hybrid",
      "integration": "Sum",
      "spectral_angle": 0.6,
      "ppm_tolerance": 5.0,
      "combine_charge_states": true
    }
  },
  "precursor_tol": {
    "ppm": [
      -20.0,
      20.0
    ]
  },
  "fragment_tol": {
    "ppm": [
      -20.0,
      20.0
    ]
  },
  "precursor_charge": [
    2,
    4
  ],
  "isotope_errors": [
    0,
    2
  ],
  "deisotope": true,
  "chimera": true,
  "wide_window": true,
  "min_peaks": 15,
  "max_peaks": 150,
  "max_fragment_charge": 1,
  "min_matched_peaks": 4,
  "report_psms": 5,
  "predict_rt": true,
  "mzml_paths": [
    "/data/26june24_hel200_100spd_OT_1ulirt_S2-F2_1_6368.mzML",
    "/data/26june24_hel200_100spd_OT_1ulirt_S2-G2_1_6369.mzML"
  ],
  "output_paths": [
    "/data/results.sage.tsv",
    "/data/lfq.tsv",
    "/data/results.json"
  ]
}
[2024-07-30T12:05:21Z INFO  sage] finished in 378s
[2024-07-30T12:05:21Z INFO  sage] cite: "Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale" https://doi.org/10.1021/acs.jproteome.3c00486
root@5885dd6c4980:/app# 

but results lfq - Copy.txt are nowhere close to expected? The parameter i am using sage.json is incorporating suggestions from https://sage-docs.vercel.app/docs/configuration/tolerance#wide-window-mode, is there something i am missing apart from PTMs which needs to be included for analysis, specifically DIA?

animesh commented 2 months ago

just to update, yield got a bit lower lfq - Copy (2).txt with latest compiled image?

docker run --rm -it -v /mnt/z/HeLa:/data animesh1977/sage /data/sage.json -f /data/human_crap.fasta -o /data /data/26june24_hel200_100spd_OT_1ulirt_S2-F2_1_6368.mzML /data/26june24_hel200_100spd_OT_1ulirt_S2-G2_1_6369.mzML
[sudo] password for ash022: 
[2024-07-30T12:15:53Z INFO  sage] generated 187444336 fragments, 5604466 peptides in 41559ms
[2024-07-30T12:15:53Z INFO  sage] processing files 0 .. 2 
[2024-07-30T12:20:20Z INFO  sage] - file IO:   267555 ms
[2024-07-30T12:20:55Z INFO  sage] - search:     34571 ms (1305 spectra/s)
[2024-07-30T12:20:55Z INFO  sage_core::ml::retention_alignment] aligning file #0: y = 1.0000x + 0.0000
[2024-07-30T12:20:55Z INFO  sage_core::ml::retention_alignment] aligning file #1: y = 1.0000x + 0.0000
[2024-07-30T12:20:55Z INFO  sage_core::ml::retention_alignment] aligned retention times across 2 files
[2024-07-30T12:20:55Z INFO  sage_core::ml::retention_model] - fit retention time model, rsq = NaN
[2024-07-30T12:20:55Z INFO  sage_core::ml::mobility_model] - fit mobility model, rsq = NaN, mse = NaN
[2024-07-30T12:21:31Z INFO  sage_core::lfq] tracing MS1 features
[2024-07-30T12:21:34Z INFO  sage_core::lfq] integrating MS1 features
[2024-07-30T12:21:34Z INFO  sage] discovered 274 target MS1 peaks at 5% FDR
[2024-07-30T12:21:34Z INFO  sage] discovered 3960 target peptide-spectrum matches at 1% FDR
[2024-07-30T12:21:34Z INFO  sage] discovered 371 target peptides at 1% FDR
[2024-07-30T12:21:34Z INFO  sage] discovered 312 target proteins at 1% FDR
{
  "version": "0.15.0-alpha",
  "database": {
    "bucket_size": 8192,
    "enzyme": {
      "missed_cleavages": 2,
      "min_len": 7,
      "max_len": 50,
      "cleave_at": "KR",
      "restrict": "P",
      "c_terminal": null,
      "semi_enzymatic": null
    },
    "peptide_min_mass": 500.0,
    "peptide_max_mass": 5000.0,
    "ion_kinds": [
      "b",
      "y"
    ],
    "min_ion_index": 2,
    "static_mods": {},
    "variable_mods": {},
    "max_variable_mods": 3,
    "decoy_tag": "rev_",
    "generate_decoys": true,
    "fasta": "/data/human_crap.fasta"
  },
  "quant": {
    "tmt": null,
    "tmt_settings": {
      "level": 3,
      "sn": false
    },
    "lfq": true,
    "lfq_settings": {
      "peak_scoring": "Hybrid",
      "integration": "Sum",
      "spectral_angle": 0.6,
      "ppm_tolerance": 5.0,
      "combine_charge_states": true
    }
  },
  "precursor_tol": {
    "ppm": [
      -20.0,
      20.0
    ]
  },
  "fragment_tol": {
    "ppm": [
      -20.0,
      20.0
    ]
  },
  "precursor_charge": [
    2,
    4
  ],
  "override_precursor_charge": false,
  "isotope_errors": [
    0,
    2
  ],
  "deisotope": true,
  "chimera": true,
  "wide_window": true,
  "min_peaks": 15,
  "max_peaks": 150,
  "max_fragment_charge": 1,
  "min_matched_peaks": 4,
  "report_psms": 5,
  "predict_rt": true,
  "mzml_paths": [
    "/data/26june24_hel200_100spd_OT_1ulirt_S2-F2_1_6368.mzML",
    "/data/26june24_hel200_100spd_OT_1ulirt_S2-G2_1_6369.mzML"
  ],
  "output_paths": [
    "/data/results.sage.tsv",
    "/data/lfq.tsv",
    "/data/results.json"
  ]
}
[2024-07-30T12:21:35Z INFO  sage] finished in 383s
[2024-07-30T12:21:35Z INFO  sage] cite: "Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale" https://doi.org/10.1021/acs.jproteome.3c00486
lazear commented 2 months ago

Looks like it's working to me.

I'm afraid I don't have the bandwidth to troubleshoot your data. My general suggestions are as follows:

  1. Search with a wider precursor tolerance and no isotope errors.
  2. Search with wider fragment tolerance

Examine the results file to determine optimal tolerances to use. In many cases where data "doesn't work", the instruments are off calibration and the user has supplied tolerances that are too narrow.