sirius-ms / sirius

SIRIUS is a software for discovering a landscape of de-novo identification of metabolites using tandem mass spectrometry. This repository contains the code of the SIRIUS Software (GUI and CLI)
GNU Affero General Public License v3.0
84 stars 20 forks source link

Sirius only partially imports MS2 spectra from mzML/mzXML files #58

Closed expoexplore closed 2 months ago

expoexplore commented 2 years ago

Hi SIRIUS team,

I tried importing mzML/mzXML files containing MS1/MS2 information into Sirius (4.9.3), but only a small part of the precursors with their respective MS2 information show up in Sirius (some precursors with MS2 are completely missing, for some only a few of the available MS2 scans show up). I converted the the .raw files (centroided) to mzML/mzXML files using MSConvert (3.0.22045). When the files are converted to .mgf files and imported, all precursors with their respective MS2 information show up in Sirius. When using other software (e.g. Thermo XCalibur) I can also see all the precursors with their MS2 spectra. Since I would like to also use the isotopic pattern information from the MS1 scans, I would need the mzML/mzXML files. I already tried out different files and older versions of MSConvert without success. Is there some filter process during the import of the data files that I am unaware of, or could there be another problem?

Thank you so much for your help!

eeko-kon commented 2 years ago

Are the missing compounds charged with something despite + or -1? Sirius can only process singly charged compounds.

Are you certain that the precursors are there in the mzML file? Can you check using an app e.g. TOPPView?

expoexplore commented 2 years ago

Thank you very much for your answer!

In TOPPView I can see all the precursors and their MS2 spectra in the mzML file . When I convert the file from mzML to mgf I can also see all the precursors so they should be present in the mzML file. I opened the mzML file with a text editor and all the charge states seem to be 1 and I can also find all the MS2 scans there.

I added a small example file to show how the problem looks like:

Example File.zip

There should be 16 precursors with one MS2 Spectra each in the file (I can see them all in TOPPView and also in the mzML/mgf file) but after importing the mzML into Sirius, I only get two precursors:

m/z 134.05 RT 1.76 min m/z 257.08 RT 1.63 min

The others don't show up in Sirius. I checked in the files if those two are different from the others in some way but I could not draw any conclusion from that. If I import the mgf file I converted from the mzML file, all precursors and their MS2 spectra show up in Sirius.

eeko-kon commented 2 years ago

From a very quick look in your mzml file : For 10 out of 16 of your fragmented precursors, you are missing the following information:

<userParam name="[Thermo Trailer Extra]Monoisotopic M/Z:" value="0" type="xsd:float"/>

which in the correct format it should include the m/z of your precursor. e.g.:

<userParam name="[Thermo Trailer Extra]Monoisotopic M/Z:" value="284.08850085825009" type="xsd:float"/>

So probably SIRIUS expects that information. It doesn't explain why you only get 2 precursors in SIRIUS but it could be a general issue with the file conversion. Since your data are thermo, I would suggest that you use the ThermoRawfileParser for conversion (in case you are using proteowizard and it gets a bit confusing).

https://github.com/compomics/ThermoRawFileParser

mono ThermoRawFileParser.exe -i={input} -b={output} and give it another try.

expoexplore commented 2 years ago

Thank you very much for your time and effort!

I looked into the file for the missing "Monoisotopic M/Z" values and checked it for each precursor. Some of the missing precursors have a correct value while one of the precursors that do show up in Sirius (m/z 257.0777), has a missing "Monoisotopic M/Z" value.

I used the GUI of ThermoRawFileParser (https://github.com/compomics/ThermoRawFileParserGUI)(v1.7.0) to convert the raw files to mzML files, but in Sirius only the same two precursors show up again. I could see all precursors in TOPPView and I also had a look into the files, where the same issue with the missing values for monoisotopic masses was present.

Could it be that the files conversion is correct, but there is some problem with the raw file at the MS1 level? Only extracting the MS2 level by converting into mgf format works without Problems.

Thermo raw file is uploaded here should anyone be interested:

https://github.com/expoexplore/Raw_file

I also tried some of my older Thermo raw files from the same machine (QExactive HF) and the same problem is occurring.

Is there any way to find out if the problem already starts in the raw file?

eeko-kon commented 2 years ago

Hey, I m not sure how to extract those files to a raw format. Can you upload the regular ones (not compressed) in zenodo or google drive and share the link please?

expoexplore commented 2 years ago

Sorry for that. I uploaded the file now on Google Drive:

https://drive.google.com/drive/folders/1vA4H9ujbk5tlGmP_sduN5zkr53FKOOvp?usp=sharing

I hope it works now.

eeko-kon commented 2 years ago

It works, thanks a lot. Indeed, there was no issue with your conversion, I am getting the exact same results.

What I can assume is that SIRIUS considers most of your precursors (with fragmentation) as noise. That can be due to very low intensity or a small number of fragments. The fragmentation patterns I see with the highest intensity are from the precursors 134.047 and 257.077.

However, I used OpenMS (OpenMS can be used for data pre-preprocessing) and lowered the intensity thershold to 1.0 and managed to get some predictions for also the following precursors (see files):

formulas_Example.csv structures_Example.csv

I hope that helps. Probably the SIRIUS people can confirm/reject my assumptions perhaps?

expoexplore commented 2 years ago

That sounds like a plausible explanation since the precursors have indeed a very low intensity.

Thank you very much for the files and the effort!