vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
280 stars 54 forks source link

cf. maxquant latest versions #682

Open animesh opened 1 year ago

animesh commented 1 year ago

I am trying to compare MaxQuant v 2.4 mqpar..xml.txt with DIA-NN v 1.8.1

diann.exe --f "C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_5dia_S1-B5_1_4359.d
" --f "C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_6dia_S1-B6_1_4360.d
" --f "C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_7dia_S1-B7_1_4361.d
" --f "C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_8dia_S1-B8_1_4362.d
" --lib "C:\Users\animeshs\DIA-NN\report-lib.predicted.speclib" --threads 6 --verbose 1 --out "C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report.tsv" --qvalue 0.01 --matrices  --out-lib "C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report-lib.tsv" --gen-spec-lib --prosit --fasta "C:\Users\animeshs\MaxQuant_v2.3.1.0\MaxQuant_v2.3.1.0\uniprot-human-iso-feb23.fasta" --met-excision --cut K*,R* --missed-cleavages 2 --min-pep-len 7 --max-pep-len 30 --min-pr-mz 300 --max-pr-mz 1800 --min-pr-charge 1 --max-pr-charge 4 --unimod4 --var-mods 1 --var-mod UniMod:35,15.994915,M --var-mod UniMod:1,42.010565,*n --monitor-mod UniMod:1 --reanalyse --relaxed-prot-inf --smart-profiling --pg-level 0 --peak-center --no-ifs-removal 
DIA-NN 1.8.1 (Data-Independent Acquisition by Neural Networks)
Compiled on Apr 14 2022 15:31:19
Current date and time: Wed Apr 26 11:16:09 2023
CPU: GenuineIntel Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz
SIMD instructions: AVX AVX2 FMA SSE4.1 SSE4.2 
Logical CPU cores: 24
Thread number set to 6
Output will be filtered at 0.01 FDR
Precursor/protein x samples expression level matrices will be saved along with the main report
A spectral library will be generated
N-terminal methionine excision enabled
In silico digest will involve cuts at K*,R*
Maximum number of missed cleavages set to 2
Min peptide length set to 7
Max peptide length set to 30
Min precursor m/z set to 300
Max precursor m/z set to 1800
Min precursor charge set to 1
Max precursor charge set to 4
Cysteine carbamidomethylation enabled as a fixed modification
Maximum number of variable modifications set to 1
Modification UniMod:35 with mass delta 15.9949 at M will be considered as variable
Modification UniMod:1 with mass delta 42.0106 at *n will be considered as variable
A spectral library will be created from the DIA runs and used to reanalyse them; .quant files will only be saved to disk during the first step
Highly heuristic protein grouping will be used, to reduce the number of protein groups obtained; this mode is recommended for benchmarking protein ID numbers; use with caution for anything else
When generating a spectral library, in silico predicted spectra will be retained if deemed more reliable than experimental ones
Implicit protein grouping: isoform IDs; this determines which peptides are considered 'proteotypic' and thus affects protein FDR calculation
Fixed-width center of each elution peak will be used for quantification
Interference removal from fragment elution curves disabled
DIA-NN will optimise the mass accuracy automatically using the first run in the experiment. This is useful primarily for quick initial analyses, when it is not yet known which mass accuracy setting works best for a particular acquisition scheme.
The following variable modifications will be scored: UniMod:1 
Unless the spectral library specified was created by this version of DIA-NN, it's strongly recommended to specify a FASTA database and use the 'Reannotate' function to allow DIA-NN to identify peptides which can originate from the N/C terminus of the protein: otherwise site localisation might not work properly for modifications of the protein N-terminus or for modifications which do not allow enzymatic cleavage after the modified residue

4 files will be processed
[0:00] Loading spectral library C:\Users\animeshs\DIA-NN\report-lib.predicted.speclib
[0:28] Library annotated with sequence database(s): C:\Users\animeshs\MaxQuant_v2.3.1.0\MaxQuant_v2.3.1.0\uniprot-human-iso-feb23.fasta
[0:33] Spectral library loaded: 103467 protein isoforms, 180207 protein groups and 11578326 precursors in 3601316 elution groups.
[0:33] Loading protein annotations from FASTA C:\Users\animeshs\MaxQuant_v2.3.1.0\MaxQuant_v2.3.1.0\uniprot-human-iso-feb23.fasta
[0:37] Annotating library proteins with information from the FASTA database
[0:37] Gene names missing for some isoforms
[0:37] Library contains 81485 proteins, and 20518 genes
[0:48] Initialising library

[0:59] First pass: generating a spectral library from DIA data
[0:59] File #1/4
[0:59] Loading run C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_5dia_S1-B5_1_4359.d
For most diaPASEF datasets it is better to manually fix both the MS1 and MS2 mass accuracies to values in the range 10-15 ppm.
[2:28] 10726618 library precursors are potentially detectable
[2:29] Processing...
[362:40] RT window set to 1.67373
[362:40] Ion mobility window set to 0.0503942
[362:40] Peak width: 7.47788
[362:40] Scan window radius set to 16
[362:43] Recommended MS1 mass accuracy setting: 14.1222 ppm
[1069:33] Optimised mass accuracy: 8.80726 ppm
[1261:38] Removing low confidence identifications
[1261:39] Searching PTM decoys
[1263:44] Removing interfering precursors
[1263:48] Training neural networks: 1264 targets, 943 decoys
[1263:50] Number of IDs at 0.01 FDR: 647
[1263:50] Calculating protein q-values
[1263:51] Number of protein isoforms identified at 1% FDR: 60 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only)
[1263:52] Quantification
[1263:54] Quantification information saved to C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_5dia_S1-B5_1_4359.d.quant.

[1263:59] File #2/4
[1263:59] Loading run C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_6dia_S1-B6_1_4360.d
[1265:20] 10726618 library precursors are potentially detectable
[1265:22] Processing...
[1691:06] RT window set to 1.5699
[1691:06] Ion mobility window set to 0.0524042
[1691:10] Recommended MS1 mass accuracy setting: 14.1592 ppm
[1899:33] Removing low confidence identifications
[1899:34] Searching PTM decoys
[1901:38] Removing interfering precursors
[1901:41] Training neural networks: 2706 targets, 1512 decoys
[1901:43] Number of IDs at 0.01 FDR: 748
[1901:43] Calculating protein q-values
[1901:45] Number of protein isoforms identified at 1% FDR: 73 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only)
[1901:45] Quantification
[1901:48] Quantification information saved to C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_6dia_S1-B6_1_4360.d.quant.

[1901:52] File #3/4
[1901:52] Loading run C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_7dia_S1-B7_1_4361.d
[1903:18] 10726618 library precursors are potentially detectable
[1903:19] Processing...
[2259:51] RT window set to 1.6635
[2259:51] Ion mobility window set to 0.051186
[2259:53] Recommended MS1 mass accuracy setting: 14.5929 ppm
[2473:18] Removing low confidence identifications
[2473:19] Searching PTM decoys
[2475:33] Removing interfering precursors
[2475:36] Training neural networks: 2286 targets, 1316 decoys
[2475:39] Number of IDs at 0.01 FDR: 771
[2475:39] Calculating protein q-values
[2475:40] Number of protein isoforms identified at 1% FDR: 81 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only)
[2475:40] Quantification
[2475:43] Quantification information saved to C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_7dia_S1-B7_1_4361.d.quant.

[2475:46] File #4/4
[2475:46] Loading run C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_8dia_S1-B8_1_4362.d
[2477:17] 10726618 library precursors are potentially detectable
[2477:18] Processing...
[2885:59] RT window set to 1.56287
[2885:59] Ion mobility window set to 0.0513684
[2886:02] Recommended MS1 mass accuracy setting: 14.0996 ppm
[3085:58] Removing low confidence identifications
[3085:59] Searching PTM decoys
[3088:10] Removing interfering precursors
[3088:13] Training neural networks: 2136 targets, 1262 decoys
[3088:16] Number of IDs at 0.01 FDR: 722
[3088:16] Calculating protein q-values
[3088:17] Number of protein isoforms identified at 1% FDR: 67 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only)
[3088:17] Quantification
[3088:20] Quantification information saved to C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_8dia_S1-B8_1_4362.d.quant.

[3088:23] Cross-run analysis
[3088:23] Reading quantification information: 4 files
[3088:23] Quantifying peptides
[3088:24] Assembling protein groups
[3088:38] Quantifying proteins
[3088:39] Calculating q-values for protein and gene groups
[3088:39] Calculating global q-values for protein and gene groups
[3088:40] Writing report
[3088:40] Report saved to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report-first-pass.tsv.
[3088:40] Saving precursor levels matrix
[3088:40] Precursor levels matrix (1% precursor and protein group FDR) saved to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report-first-pass.pr_matrix.tsv.
[3088:40] Saving protein group levels matrix
[3088:40] Protein group levels matrix (1% precursor FDR and protein group FDR) saved to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report-first-pass.pg_matrix.tsv.
[3088:40] Saving gene group levels matrix
[3088:40] Gene groups levels matrix (1% precursor FDR and protein group FDR) saved to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report-first-pass.gg_matrix.tsv.
[3088:40] Saving unique genes levels matrix
[3088:40] Unique genes levels matrix (1% precursor FDR and protein group FDR) saved to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report-first-pass.unique_genes_matrix.tsv.
[3088:40] Stats report saved to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report-first-pass.stats.tsv
[3088:40] Generating spectral library:
[3088:40] 1018 precursors passing the FDR threshold are to be extracted
[3088:40] Loading run C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_5dia_S1-B5_1_4359.d
[3090:09] 10726618 library precursors are potentially detectable
[3090:11] 289 spectra added to the library
[3090:12] Loading run C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_6dia_S1-B6_1_4360.d
[3091:42] 10726618 library precursors are potentially detectable
[3091:44] 122 spectra added to the library
[3091:45] Loading run C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_7dia_S1-B7_1_4361.d
[3093:20] 10726618 library precursors are potentially detectable
[3093:22] 70 spectra added to the library
[3093:23] Loading run C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_8dia_S1-B8_1_4362.d
[3094:53] 10726618 library precursors are potentially detectable
[3094:55] 47 spectra added to the library
[3094:57] Saving spectral library to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report-lib.tsv
[3094:57] 1018 precursors saved
[3094:57] Loading the generated library and saving it in the .speclib format
[3094:57] Loading spectral library C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report-lib.tsv
[3094:57] Spectral library loaded: 1261 protein isoforms, 451 protein groups and 1018 precursors in 868 elution groups.
[3094:57] Loading protein annotations from FASTA C:\Users\animeshs\MaxQuant_v2.3.1.0\MaxQuant_v2.3.1.0\uniprot-human-iso-feb23.fasta
[3095:02] Gene names missing for some isoforms
[3095:02] Library contains 952 proteins, and 378 genes
[3095:02] Saving the library to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report-lib.tsv.speclib

[3095:13] Second pass: using the newly created spectral library to reanalyse the data
[3095:13] File #1/4
[3095:13] Loading run C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_5dia_S1-B5_1_4359.d
[3096:38] 1018 library precursors are potentially detectable
[3096:38] Processing...
[3096:43] RT window set to 0.461055
[3096:43] Ion mobility window set to 0.0151745
[3096:43] Recommended MS1 mass accuracy setting: 12.0146 ppm
[3096:43] Removing low confidence identifications
[3096:43] Searching PTM decoys
[3096:43] Removing interfering precursors
[3096:43] Training neural networks: 997 targets, 738 decoys
[3096:44] Number of IDs at 0.01 FDR: 911
[3096:44] Calculating protein q-values
[3096:44] Number of protein isoforms identified at 1% FDR: 84 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only)
[3096:44] Quantification

[3096:47] File #2/4
[3096:47] Loading run C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_6dia_S1-B6_1_4360.d
[3098:11] 1018 library precursors are potentially detectable
[3098:11] Processing...
[3098:15] RT window set to 0.45371
[3098:15] Ion mobility window set to 0.0145698
[3098:15] Recommended MS1 mass accuracy setting: 13.118 ppm
[3098:15] Removing low confidence identifications
[3098:15] Searching PTM decoys
[3098:15] Removing interfering precursors
[3098:15] Training neural networks: 1003 targets, 743 decoys
[3098:17] Number of IDs at 0.01 FDR: 933
[3098:17] Calculating protein q-values
[3098:17] Number of protein isoforms identified at 1% FDR: 86 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only)
[3098:17] Quantification

[3098:19] File #3/4
[3098:19] Loading run C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_7dia_S1-B7_1_4361.d
[3099:47] 1018 library precursors are potentially detectable
[3099:47] Processing...
[3099:52] RT window set to 0.453016
[3099:52] Ion mobility window set to 0.0153359
[3099:52] Recommended MS1 mass accuracy setting: 12.7581 ppm
[3099:52] Removing low confidence identifications
[3099:52] Searching PTM decoys
[3099:52] Removing interfering precursors
[3099:52] Training neural networks: 1011 targets, 756 decoys
[3099:53] Number of IDs at 0.01 FDR: 963
[3099:53] Calculating protein q-values
[3099:54] Number of protein isoforms identified at 1% FDR: 88 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only)
[3099:54] Quantification

[3099:56] File #4/4
[3099:56] Loading run C:\Users\animeshs\230421_plasma\New\DIA\2320424_plasma_pCA_8dia_S1-B8_1_4362.d
[3101:21] 1018 library precursors are potentially detectable
[3101:21] Processing...
[3101:25] RT window set to 0.452874
[3101:25] Ion mobility window set to 0.0147734
[3101:25] Recommended MS1 mass accuracy setting: 13.3857 ppm
[3101:26] Removing low confidence identifications
[3101:26] Searching PTM decoys
[3101:26] Removing interfering precursors
[3101:26] Training neural networks: 1004 targets, 740 decoys
[3101:27] Number of IDs at 0.01 FDR: 953
[3101:27] Calculating protein q-values
[3101:27] Number of protein isoforms identified at 1% FDR: 89 (precursor-level), 0 (protein-level) (inference performed using proteotypic peptides only)
[3101:27] Quantification

[3101:30] Cross-run analysis
[3101:30] Reading quantification information: 4 files
[3101:30] Quantifying peptides
[3101:31] Quantifying proteins
[3101:32] Calculating q-values for protein and gene groups
[3101:33] Calculating global q-values for protein and gene groups
[3101:33] Writing report
[3101:33] Report saved to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report.tsv.
[3101:33] Saving precursor levels matrix
[3101:33] Precursor levels matrix (1% precursor and protein group FDR) saved to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report.pr_matrix.tsv.
[3101:33] Saving protein group levels matrix
[3101:33] Protein group levels matrix (1% precursor FDR and protein group FDR) saved to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report.pg_matrix.tsv.
[3101:33] Saving gene group levels matrix
[3101:33] Gene groups levels matrix (1% precursor FDR and protein group FDR) saved to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report.gg_matrix.tsv.
[3101:33] Saving unique genes levels matrix
[3101:33] Unique genes levels matrix (1% precursor FDR and protein group FDR) saved to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report.unique_genes_matrix.tsv.
[3101:33] Stats report saved to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report.stats.tsv
[3101:33] Log saved to C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report.log.txt
Finished

DIA-NN exited
DIA-NN-plotter.exe "C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report.stats.tsv" "C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report.tsv" "C:\Users\animeshs\230421_plasma\New\DIA\DIANN\report.pdf"
PDF report will be generated in the background

And i see that there about about 185 protein-groups quantified by MaxQuant proteinGroups.txt with following scatter plot/R^2 (in blue) image

while DIA-NN has 251 report.pg_matrix.tsv.txt with R^2 image

So it looks like more quantifications but disperse, specially at lower end of distribution?

When i compare the IDs (probably it is not fair, i used defaults for inference)

N: Count T: Datasets 138 Dataset 1 72 Dataset 2 113 Dataset 1;Dataset 2 comparePGs.txt with following scatter for common IDs image

So i am not sure what these uniques are and what is the cause of discrepancy, essentially what to trust? Quantified in both with similar values? Or compare in better way maybe?

vdemichev commented 1 year ago

Cannot really comment on MaxQuant. In general, if software A detect more peptides/proteins than software B, those extra IDs will likely be low-abundant, which means their quantities will be noisy.

animesh commented 1 year ago

I See that i am not using same parameters as well, for example

  1. deamidation is missing in DIA-NN. How can i supply that as an argument to DIA-NN?
  2. Also, the quantification is being done in MaxQuant with at least one unique peptide, not sure how to do it in DIA-NN?
  3. Protein inference is i beleive at uniprot level in MaxQuant and i think that corresponds to "Isoform IDs" in DIA-NN, is that correct? Just looking for the most fair way to compare...
vdemichev commented 1 year ago

Hi Ani,

  1. It's really not a good idea to do lib-free analysis of DIA data with deamidation searched. The fact that some software tools kind of support this, does not mean the FDR of their results is reasonable, can easily be like 80% false discoveries at supposed 1% FDR. DIA-NN does support this with PTM scoring (--monitor-mod), but be prepared to IDs numbers reduction.

What I would suggest, is to first do the analysis following the guidelines in DIA-NN docs, most importantly keeping things default (e.g. M(ox) not enabled) unless explicitly advised otherwise therein, and once you have those results, can experiment with adding extra stuff & see if this is beneficial.

  1. 'iq' or 'diann' R package, filter the data frame for Proteotypic == 1. Or use Genes.MaxLFQ.Unique pre-computed by DIA-NN. But for most scenarious using PG.MaxLFQ or Genes.MaxLFQ is fine.

  2. Protein inference in DIA-NN is always at sequence ID level, but the 'Protein inference' setting provides a hint to DIA-NN which sequence IDs can be naturally grouped together, so keeping it at the default 'Genes' makes sense.

Best, Vadim

animesh commented 1 year ago

Thanks for the reco @vdemichev 👍🏽 BTW is there a way to create msms and evidence.txt from DIA-NN library -free search/Fasta file which i can use as input to MaxDIA?