Open rolivella opened 1 year ago
The mzML contains profile data, while DIA-NN reads centroided from .raw?
The conversion tool automatically converts to mzML centroided because I did not specified this option:
-p, --noPeakPicking[=VALUE]
Don't use the peak picking provided by the native
Thermo library. By default peak picking is
enabled. Optional argument allows disabling peak
peaking only for selected MS levels and should
be a comma-separated list of integers (1,2,3)
and/or intervals (1-3), open-end intervals (1-)
are allowed
So By default peak picking is enabled
Maybe a different algorithm than used by the Windows Thermo .dll?
Anyway, please also try with 1.8.1, 1.8 had some bugs on Linux that were fixed in 1.8.1
Is DIANN expecting centroid .raw files (MS1 and MS2)? Or it does not care whether Thermo .raw files are centroid or profile.
When .raw files are profile, the Thermo .dll centroids them, so what DIA-NN gets is always centroided spectra.
Hi,
I did some similar tests with .raw
, .raw.dia
and .mzML
running on Windows 10, Linux and Linux+Wine since everything should be run in a docker container.
There were also some differences between the different conversions.
Here are the result tables:
For this we need to first install wine (done in dockerfile) then install DIA-NN and MSFileReader from Thermo (Version 3.0 SP2)
NOTE: The rawfile to dia in linux+wine convert step only works with --threads 1
R1-3 are 3 different raw files
All linux tests used the Diann native version
mzML were created with ThermoRawFileParser
R1 | Mode | Precursors.Identified | Proteins.Identified | Total.Quantity |
---|---|---|---|---|
win thermo raw | 65337 | 6555 | 6.17434e+11 | |
linux dia | 65337 | 6555 | 6.17434e+11 | |
win dia | 65337 | 6555 | 6.17434e+11 | |
linux noise mzML | 65074 | 6583 | 6.16577e+11 | |
win noise mzML | 65074 | 6620 | 6.16576e+11 | |
linux mzML | 65074 | 6583 | 6.16577e+11 | |
win mzML | 65074 | 6620 | 6.16576e+11 |
R2 | Mode | Precursors.Identified | Proteins.Identified | Total.Quantity |
---|---|---|---|---|
win thermo raw | 65312 | 6628 | 6.28592e+11 | |
linux dia | 65312 | 6628 | 6.28592e+11 | |
win dia | 65312 | 6628 | 6.28592e+11 | |
linux noise mzML | 65048 | 6565 | 6.26993e+11 | |
win noise mzML | 65052 | 6568 | 6.26995e+11 | |
linux mzML | 65048 | 6565 | 6.26993e+11 | |
win mzML | 65052 | 6568 | 6.26995e+11 |
R3 | Mode | Precursors.Identified | Proteins.Identified | Total.Quantity |
---|---|---|---|---|
win thermo raw | 65105 | 6579 | 6.41853e+11 | |
linux dia | 65105 | 6579 | 6.41853e+11 | |
win dia | 65105 | 6579 | 6.41853e+11 | |
linux noise mzML | 65122 | 6569 | 6.40738e+11 | |
win noise mzML | 65130 | 6557 | 6.4062e+11 | |
linux mzML | 65122 | 6569 | 6.40738e+11 | |
win mzML | 65130 | 6557 | 6.4062e+11 |
Hi again!
We tested DIANN 1.8 (by command line in linux) with the same file and observed different results depending on if we used .mzML or .raw files. For instance, this is the number of precursors identified and sequence overlap for the 001.raw file compared to 001.mzML:
Do you know the reason for this variance?
We also observed that systematically get around 15% more identifications by using .mzML.
The files were converted by https://github.com/compomics/ThermoRawFileParser with this command line:
ThermoRawFileParser.sh -i=001.raw -f=2 -o ./
If you want I can share the original files by private message.
Thanks!