vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
277 stars 54 forks source link

Slightly better results with .mzML than .raw format from the same file #659

Open rolivella opened 1 year ago

rolivella commented 1 year ago

Hi again!

We tested DIANN 1.8 (by command line in linux) with the same file and observed different results depending on if we used .mzML or .raw files. For instance, this is the number of precursors identified and sequence overlap for the 001.raw file compared to 001.mzML:

thumbnail_image002

Do you know the reason for this variance?

We also observed that systematically get around 15% more identifications by using .mzML.

The files were converted by https://github.com/compomics/ThermoRawFileParser with this command line:

ThermoRawFileParser.sh -i=001.raw -f=2 -o ./

If you want I can share the original files by private message.

Thanks!

vdemichev commented 1 year ago

The mzML contains profile data, while DIA-NN reads centroided from .raw?

rolivella commented 1 year ago

The conversion tool automatically converts to mzML centroided because I did not specified this option:

-p, --noPeakPicking[=VALUE]
                             Don't use the peak picking provided by the native
                               Thermo library. By default peak picking is
                               enabled. Optional argument allows disabling peak
                               peaking only for selected MS levels and should
                               be a comma-separated list of integers (1,2,3)
                               and/or intervals (1-3), open-end intervals (1-)
                               are allowed

So By default peak picking is enabled

vdemichev commented 1 year ago

Maybe a different algorithm than used by the Windows Thermo .dll?

Anyway, please also try with 1.8.1, 1.8 had some bugs on Linux that were fixed in 1.8.1

edunivers commented 1 year ago

Is DIANN expecting centroid .raw files (MS1 and MS2)? Or it does not care whether Thermo .raw files are centroid or profile.

vdemichev commented 1 year ago

When .raw files are profile, the Thermo .dll centroids them, so what DIA-NN gets is always centroided spectra.

ffullomicscouts commented 1 year ago

Hi,

I did some similar tests with .raw, .raw.dia and .mzML running on Windows 10, Linux and Linux+Wine since everything should be run in a docker container. There were also some differences between the different conversions.

Here are the result tables:

For this we need to first install wine (done in dockerfile) then install DIA-NN and MSFileReader from Thermo (Version 3.0 SP2)
NOTE: The rawfile to dia in linux+wine convert step only works with --threads 1
R1-3 are 3 different raw files All linux tests used the Diann native version mzML were created with ThermoRawFileParser

R1 Mode Precursors.Identified Proteins.Identified Total.Quantity
win thermo raw 65337 6555 6.17434e+11
linux dia 65337 6555 6.17434e+11
win dia 65337 6555 6.17434e+11
linux noise mzML 65074 6583 6.16577e+11
win noise mzML 65074 6620 6.16576e+11
linux mzML 65074 6583 6.16577e+11
win mzML 65074 6620 6.16576e+11
R2 Mode Precursors.Identified Proteins.Identified Total.Quantity
win thermo raw 65312 6628 6.28592e+11
linux dia 65312 6628 6.28592e+11
win dia 65312 6628 6.28592e+11
linux noise mzML 65048 6565 6.26993e+11
win noise mzML 65052 6568 6.26995e+11
linux mzML 65048 6565 6.26993e+11
win mzML 65052 6568 6.26995e+11
R3 Mode Precursors.Identified Proteins.Identified Total.Quantity
win thermo raw 65105 6579 6.41853e+11
linux dia 65105 6579 6.41853e+11
win dia 65105 6579 6.41853e+11
linux noise mzML 65122 6569 6.40738e+11
win noise mzML 65130 6557 6.4062e+11
linux mzML 65122 6569 6.40738e+11
win mzML 65130 6557 6.4062e+11