vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
252 stars 53 forks source link

DiaNN produces different results on same data as .wiff or .mzML input format #777

Open calizilla opened 1 year ago

calizilla commented 1 year ago

Hi,

I have .wiff data and converted this to .mzML with Proteowizard so that I could run DiaNN on Linux. The results of analysing the .mzML files on Linux (v 1.8.1) were quite different compared to the same samples processed from .wiff with DiaNN GUI (v 1.8).

These same .mzML files were then analysed with DiaNN GUI v 1.8, with the same result: much higher absolute values, and lower unique genes per sample, compared to when run as wiff input.

The same parameter settings were applied.

In the attached spreadsheet, you will find the unique genes matrices output for 5 samples run on v 1.8 GUI as wiff or mzML input.

The number of unique genes identified was lower using mzML (3027-3087 per sample for wiff, versus 2747 - 2769 for mzML).

The absolute values were higher for mzML - overall approx 38-fold, but there was a range.

The spreadsheet includes a table sorted according to the SD of mean ratio of values (mzML/Wiff) across the 5 samples.

For our purposes we don’t think it matters that the ratio changes between proteins (it ranges from <1 to over 6000, 75% within 32-42) because we're not using it to compare relative abundance of different proteins, but if it's not consistent between samples for the same protein then we’ll get different answers for the effects of experimental parameters on that protein, depending on whether we use wiff or mzML files.

I attempted to run the Windows version of DiaNN on Linux through wine, hoping this could prevent the need to convert wiff to mzML in the first place (a great saving from compute cost and disk space), but was unsuccessful in getting this running.

Comparison spreadsheet: mzML-wiff-5-sample-comparison.xlsx

Run logs: mzML_Test2_mzML_results_report.log.txt mzML_Test2_Wiff_Results_report.log.txt

The conversion from wiff to mzML was done on Linux using the below command:

singularity run --env WINEDEBUG=-all \
        -B /scratch/:/scratch \
        pwiz.sif wine msconvert \
        ${wiff} \
        --32 \
        --filter "peakPicking vendor msLevel=1-" \
        -o ${outdir} \
        --outfile ${sampleID}.mzML

I also ran the conversion using the GUI MSConvert with the same parameters and the output files were identical (save file paths of course).

Many thanks Cali

vdemichev commented 1 year ago

Hi Cali,

mzML and .wiff results should be comparable but not identical. In your case the difference between the analyses is quite signficant. One thing contributing to that is that mass accs and scan window are not fixed. Can you please also try generating mzML using MSConvert GUI and specifically the settings recommended in the docs?

Also, on Linux please use exclusively 1.8.1, not 1.8.

Wine 6.8 or later is required.

Best, Vadim

calizilla commented 1 year ago

Hi Vadim,

Thank you for your fast response.

Can you please indicate what my msconvert command line should look like? As far as I was able to discern, I thought I had applied the same settings recommended in the docs (which appeared to be only '32 bit' and 'peakPicking vendor msLevel=1-' from the GUI screenshot shown on the docs).

Yes I am using 1.8.1 on Linux (after encountering the mutlithreading bug :blush:)

We are using docker://biocontainers/diann:v1.8.1_cv1 on Linux. I tried various containers for wine, all > 6.8, but encountered various errors, including 'no exec filesystem'. I am attempting to run on NCI Gadi.

Regarding the scan windows, I am aware that leaving fixed or automatic makes a difference, and have accounted for different user options in our Linux workflow under development. For the 5-sample GUI runs, I would not expect this to detract from the 'apples to apples' comparison, because exactly the same command was run. From the log: "DIA-NN will optimise the mass accuracy automatically using the first run in the experiment. This is useful primarily for quick initial analyses, when it is not yet known which mass accuracy setting works best for a particular acquisition scheme." And then for the first sample (same in both runs), mass accuracy is estimated at 21.6483 (from mzML input) and 21.6518 (from wiff input). Happy to re-run with these values set to exactly the same, but given that they were fixed to such highly similar values in the 2 runs, I don't expect this will resolve the vast difference in results. Please correct me if I am wrong :smile:

Kind regards Cali

vdemichev commented 1 year ago

Hi Cali,

I just use the GUI, I will suggest to try it and see how results look like. Not sure about command line options.

Best, Vadim

calizilla commented 1 year ago

Hi Vadim,

Here are the settings I used on the GUI. My version has a clickable option to show the command line - I used this command line when running on Linux.

msconvert-GUI-params

I cannot find the 'titleMaker' filter shown on the MSConvertGUI screenshot in the DiaNN README on the MSConvert GUI, but I can see it on the MSConvert CLI guide. I added this to the CLI run:

singularity run --env WINEDEBUG=-all \
        -B /scratch/:/scratch \
        ${pwiz} wine msconvert \
        ${wiff} \
        --32 \
        --filter "peakPicking vendor msLevel=1-" \
        --filter "titleMaker <RunId>.<ScanNumber>.<ScanNumber>.<ChargeState> File:<SourcePath>, NativeID:<Id>" \
        -o ./ \
        --outfile ${sampleID}.mzML

I converted the same sample from wiff to mzML with the above command and also using the PC MSConvert GUI (without 'titleMaker' filter as I could not find it in the GUI).

The output files are identical except for the inclusion of "spectrum title" lines in the mzML file produced with the 'titleMaker' filter on CLI.

Is the inclusion of the "spectrum title" lines critical for DiaNN? This is a head of the grep output:

          <cvParam cvRef="MS" accession="MS:1000796" name="spectrum title" value="061221_WD_EXP1_1_1-Sample015.1.1. File:061221_WD_EXP1_1_1.wiff, NativeID:sample=1 period=1 cycle=1 experiment=2"/>
          <cvParam cvRef="MS" accession="MS:1000796" name="spectrum title" value="061221_WD_EXP1_1_1-Sample015.2.2. File:061221_WD_EXP1_1_1.wiff, NativeID:sample=1 period=1 cycle=1 experiment=7"/>
          <cvParam cvRef="MS" accession="MS:1000796" name="spectrum title" value="061221_WD_EXP1_1_1-Sample015.3.3. File:061221_WD_EXP1_1_1.wiff, NativeID:sample=1 period=1 cycle=1 experiment=8"/>
          <cvParam cvRef="MS" accession="MS:1000796" name="spectrum title" value="061221_WD_EXP1_1_1-Sample015.4.4. File:061221_WD_EXP1_1_1.wiff, NativeID:sample=1 period=1 cycle=1 experiment=9"/>
          <cvParam cvRef="MS" accession="MS:1000796" name="spectrum title" value="061221_WD_EXP1_1_1-Sample015.5.5. File:061221_WD_EXP1_1_1.wiff, NativeID:sample=1 period=1 cycle=1 experiment=10"/>
          <cvParam cvRef="MS" accession="MS:1000796" name="spectrum title" value="061221_WD_EXP1_1_1-Sample015.6.6. File:061221_WD_EXP1_1_1.wiff, NativeID:sample=1 period=1 cycle=1 experiment=11"/>
          <cvParam cvRef="MS" accession="MS:1000796" name="spectrum title" value="061221_WD_EXP1_1_1-Sample015.7.7. File:061221_WD_EXP1_1_1.wiff, NativeID:sample=1 period=1 cycle=1 experiment=12"/>
          <cvParam cvRef="MS" accession="MS:1000796" name="spectrum title" value="061221_WD_EXP1_1_1-Sample015.8.8. File:061221_WD_EXP1_1_1.wiff, NativeID:sample=1 period=1 cycle=1 experiment=13"/>
          <cvParam cvRef="MS" accession="MS:1000796" name="spectrum title" value="061221_WD_EXP1_1_1-Sample015.9.9. File:061221_WD_EXP1_1_1.wiff, NativeID:sample=1 period=1 cycle=1 experiment=14"/>
          <cvParam cvRef="MS" accession="MS:1000796" name="spectrum title" value="061221_WD_EXP1_1_1-Sample015.10.10. File:061221_WD_EXP1_1_1.wiff, NativeID:sample=1 period=1 cycle=1 experiment=15"/>

Kind regards, Cali

calizilla commented 1 year ago

Hi Vadim,

We have repeated the 5-sample comparison of wiff vs mzML input setting the same fixed values for scan window, mass accuracy and MS1 by adding the below parameters:

 --window 8 --mass-acc 26.6518 --mass-acc-ms1 20 

The results are very similar to those attached to the original issue message, where unique genes per sample are fewer for mzML input, values are much higher for mzML input, and the ratio of mzML to wiff values are not consistent for many genes.

5-sample-comparison-fixed-params.xlsx

We also ran the 5-sample comparison using the new mzML files created with the above command including 'titleMaker', and the output was identical.

Now that we have confirmed that the conversion to mzML has exactly followed the recommended parameters in the DiaNN user guide, and have run the analyses with fixed scan window and mass accuracy, are you able to invesitgate further? We can share the 5 samples with you as wiff and mzML.

Kind regards, Cali

calizilla commented 1 year ago

Hi Vadim,

After using Proteowizard SeeMS to compare the wiff and covnerted mzML, its evident that the 'data points' values on mzML are much lower than for wiff - please see the screenshot and data for one sample here.

Matt Chambers has queried whether Dia-NN does peak picking differently when run directly on .wiff input. Could you please clarify if this is the case? And what parameters we can apply - either to file conversion or Dia-NN - to ensure the output of mzML-based Dia-NN analysis produces consistent results with wiff-based analysis?

Many thanks Cali