lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
201 stars 38 forks source link

Convert mgf to mzml #76

Closed gsaxena888 closed 1 year ago

gsaxena888 commented 1 year ago

I have mgf files only (no mzml). I'm thinking of using msconvert (on linux) to convert the mgf to mzml....But I know that some programs don't always work 100% properly when that occurrs (due to some expectaion of what should be in a mzML file etc.) Is there any known/guessed issue that might arise if I take a reguar/simple mgf file and try to convert it to mzML? (The reason for this need: the mgf is a pseudo generated file, and it's generated off of DIA data in a manner similar to how DIAUmpire works.)

lazear commented 1 year ago

Without having the mgf files, I can't say - but assuming they produce valid mzMLs, Sage should be able to handle them. I have successfully converted mzXMLs to mzMLs and searched them. If you find that they don't work, please let me know and I will push out a fix.

Also, you can directly search DIA data (no DIA-specific quant yet though) with Sage 😃 - I have successfully searched data from TTOFs, Orbitraps, and even the Astral. Be warned that performance may not be up to par with other tools at this time, since it's not specifically designed for searching DIA data

Something like the following params typically works well

{
  "chimera": true, 
  "wide_window": true,
  "max_fragment_charge": 2
  "report_psms": 5
}
gsaxena888 commented 11 months ago

So I converted a portion of an mgf file to mzML using msconvert, but when I tried to run it through sage, it errored out. (Note: I believe a similar mgf to mzML conversion that I did years ago for msfragger worked fine; also, the error message from sage said that there was no ms1 info, but I did notice some basic ms1 info in the mZML file). Here is the small mgf, the convertered mzML (via msconvert running on Linux), and the fasta and simple config files:

<deleted attachment; see next comment>

And the full error message was:

bash-5.1# sage config2.json 
[2023-07-18T20:58:16Z INFO  sage] generated 722 fragments in 344ms
thread '<unnamed>' panicked at 'missing precursor information for MS2 scan, please check input files!', src/spectrum.rs:220:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted (core dumped

Thoughts?

gsaxena888 commented 11 months ago

Please use this attachment and DELETE the previous one: allFiles.tar.gz @lazear

lazear commented 11 months ago

I am able to search that mzML fine. Which version of Sage are you using?

Input

#!/bin/bash
wget https://github.com/lazear/sage/releases/download/v0.13.3/sage-v0.13.3-x86_64-unknown-linux-gnu.tar.gz
wget https://github.com/lazear/sage/files/12089000/allFiles.tar.gz
tar xvf allFiles.tar.gz
tar xvf sage-v0.13.3-x86_64-unknown-linux-gnu.tar.gz

SAGE_LOG=trace ./sage-v0.13.3-x86_64-unknown-linux-gnu/sage config2.json -f fasta_with_decoy.fasta input.mzML

Output

[2023-07-18T22:52:28Z TRACE sage_core::database] modifying peptides
[2023-07-18T22:52:31Z TRACE sage_core::database] sorting and deduplicating peptides
[2023-07-18T22:52:32Z TRACE sage_core::database] generating fragments
[2023-07-18T22:52:32Z TRACE sage_core::database] finalizing index
[2023-07-18T22:52:33Z INFO  sage] generated 104076437 fragments, 6153956 peptides in 5711ms
[2023-07-18T22:52:33Z INFO  sage] processing files 0 .. 1
[2023-07-18T22:52:33Z TRACE sage]  - input.mzML: read 476 spectra
[2023-07-18T22:52:33Z INFO  sage]  - file IO:       50 ms
[2023-07-18T22:52:33Z INFO  sage]  - search:        65 ms (476 spectra)
[2023-07-18T22:52:33Z INFO  sage_core::ml::retention_alignment] aligning file #0: y = 1.0000x + 0.0000
[2023-07-18T22:52:33Z INFO  sage_core::ml::retention_alignment] aligned retention times across 1 files
[2023-07-18T22:52:33Z INFO  sage_core::ml::retention_model] - fit retention time model, rsq = NaN
[2023-07-18T22:52:33Z TRACE sage_core::ml::linear_discriminant] fitting linear discriminant model...
[2023-07-18T22:52:33Z TRACE sage_core::ml::linear_discriminant] - linear model fit with {"rank": -0.0, "charge": -0.008937116197891792, "ln1p(hyperscore)": 0.02537321405895736, "ln1p(delta_next)": 0.0008479605715541809, "ln1p(delta_best)": -0.0, "delta_mass_model": -0.2984225949733574, "isotope_error": -0.0027129223649496534, "average_ppm": -0.005395587524408603, "ln1p(-poisson)": -0.04070435342311652, "ln1p(matched_intensity_pct)": 0.0007503495236815217, "ln1p(matched_peaks)": 0.004289864991880172, "ln1p(longest_b)": 0.004771601232359431, "ln1p(longest_y)": -0.14894784710117093, "longest_y_pct": 0.8432765693783117, "ln1p(peptide_len)": -0.09478725602175102, "missed_cleavages": -0.0008780636293367089, "rt": -0.21546556691038277, "sqrt(delta_rt_model)": 0.3460821778822863}
[2023-07-18T22:52:33Z TRACE sage_core::ml::linear_discriminant] - fitting non-parametric model for posterior error probabilities
[2023-07-18T22:52:39Z INFO  sage] discovered 0 target peptide-spectrum matches at 1% FDR
[2023-07-18T22:52:39Z INFO  sage] discovered 0 target peptides at 1% FDR
[2023-07-18T22:52:39Z INFO  sage] discovered 0 target proteins at 1% FDR
[2023-07-18T22:52:39Z TRACE sage] writing outputs
{
  "version": "0.13.3",
  "database": {
    "bucket_size": 16384,
    "enzyme": {
      "missed_cleavages": 1,
      "min_len": null,
      "max_len": null,
      "cleave_at": "KR",
      "restrict": "P",
      "c_terminal": null
    },
    "fragment_min_mz": 150.0,
    "fragment_max_mz": 1500.0,
    "peptide_min_mass": 500.0,
    "peptide_max_mass": 5000.0,
    "ion_kinds": [
      "b",
      "y"
    ],
    "min_ion_index": 2,
    "static_mods": {
      "C": 57.0216
    },
    "variable_mods": {},
    "max_variable_mods": 2,
    "decoy_tag": "rev_",
    "generate_decoys": true,
    "fasta": "fasta_with_decoy.fasta"
  },
  "quant": {
    "tmt": null,
    "tmt_settings": {
      "level": 3,
      "sn": false
    },
    "lfq": false,
    "lfq_settings": {
      "peak_scoring": "Hybrid",
      "integration": "Sum",
      "spectral_angle": 0.7,
      "ppm_tolerance": 5.0
    }
  },
  "precursor_tol": {
    "ppm": [
      -50.0,
      50.0
    ]
  },
  "fragment_tol": {
    "ppm": [
      -10.0,
      10.0
    ]
  },
  "isotope_errors": [
    -1,
    3
  ],
  "deisotope": true,
  "chimera": false,
  "wide_window": false,
  "min_peaks": 15,
  "max_peaks": 150,
  "max_fragment_charge": 1,
  "min_matched_peaks": 4,
  "report_psms": 1,
  "predict_rt": true,
  "mzml_paths": [
    "input.mzML"
  ],
  "output_paths": [
    "/mnt/d/Github/sage/issue76/results.sage.tsv",
    "/mnt/d/Github/sage/issue76/results.json"
  ]
}
gsaxena888 commented 11 months ago

I was using docker, ie sudo docker pull ghcr.io/lazear/sage:master

lazear commented 11 months ago

Can you share the full setup you used, e.g., volume mounts, and anything else that might help me troubleshoot?

gsaxena888 commented 11 months ago

@lazear the non-docker version seems to work fine so far!

lazear commented 11 months ago

That is still somewhat concerning... they should behave identically (unless something is up with the volume mounts etc). I use the docker image via AWS Batch without issues, but admittedly haven't done much testing using docker locally