vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
266 stars 53 forks source link

WARNING: 458825 precursors were wrongly annotated in the library as proteotypic #1014

Open tobiasko opened 4 months ago

tobiasko commented 4 months ago

I used the following commands to run DIA-NN and got the above warning:

nice -19 /usr/diann/1.8.2_beta_8/linux/diann-1.8.1.8 --threads 32 --mass-acc 15 --mass-acc-ms1 15 --matrices --pg-level 1 --report-lib-info --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_A_Sample_Beta_01.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_A_Sample_Gamma_02.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_B_Sample_Alpha_01.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_A_Sample_Gamma_01.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_B_Sample_Beta_02.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_03.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_B_Sample_Beta_03.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_B_Sample_Beta_01.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_B_Sample_Alpha_03.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_A_Sample_Beta_03.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_B_Sample_Gamma_01.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_02.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_A_Sample_Beta_02.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_A_Sample_Alpha_01.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_B_Sample_Alpha_02.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_A_Sample_Gamma_03.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_B_Sample_Gamma_02.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_B_Sample_Gamma_03.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Ecoli_01.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Ecoli_02.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Ecoli_03.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Human_01.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Human_02.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Human_03.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_QC_02.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_QC_03.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_QC_04.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_QC_06.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_QC_05.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_QC_07.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_QC_08.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_QC_09.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Yeast_01.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Yeast_02.mzML --f /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Yeast_03.mzML --lib /scratch/cpanse/PXD028735/dia/diann.predicted.speclib --vis 20,LGGNEQVTR,YILAGVENSK,GTFIIDPGGVIR,GTFIIDPAAVIR,GAGSSEPVTGLDAK,TPVISGGPYEYR,VEATFGVDESNAK,TPVITGAPYEYR,DGLDAASYYAPVR,ADVTPADFSEWSK,LFLQFGAQGSPFLK --temp /scratch/cpanse/PXD028735/dia/temp-2024-05-14_17-37-42/ --out /scratch/cpanse/PXD028735/dia/out-2024-05-14_17-37-42/diann-output.tsv
DIA-NN 1.8.2 beta 8 (Data-Independent Acquisition by Neural Networks)
Compiled on Dec  1 2022 14:47:06
Current date and time: Tue May 14 17:37:42 2024
Logical CPU cores: 128
Thread number set to 32
Precursor/protein x samples expression level matrices will be saved along with the main report
Implicit protein grouping: protein names; this determines which peptides are considered 'proteotypic' and thus affects protein FDR calculation
XICs for precursors corresponding to 11 peptides will be saved
Mass accuracy will be fixed to 1.5e-05 (MS2) and 1.5e-05 (MS1)

35 files will be processed
[0:00] Loading spectral library /scratch/cpanse/PXD028735/dia/diann.predicted.speclib
[0:01] Library annotated with sequence database(s): /scratch/cpanse/PXD028735/fasta/uniprotkb_proteome_UP000000625_UP000002311_UP000005640_iRTkit_2023_07_04.fasta
[0:01] Gene names missing for some isoforms
[0:01] Library contains 92153 proteins, and 30205 genes
[0:02] Spectral library loaded: 92153 protein isoforms, 131350 protein groups and 1884033 precursors in 1008008 elution groups.
[0:02] Initialising library

[0:04] File #1/35
[0:04] Loading run /scratch/cpanse/PXD028735/dia/LFQ_Orbitrap_AIF_Condition_A_Sample_Beta_01.mzML
[1:02] 1303231 library precursors are potentially detectable
[1:02] Processing...
[1:27] RT window set to 8.07056
[1:27] Peak width: 7.436
[1:27] Scan window radius set to 16
[1:27] Recommended MS1 mass accuracy setting: 8.75961 ppm
[2:50] Removing low confidence identifications
[2:50] Removing interfering precursors
[2:55] Training neural networks: 107200 targets, 77468 decoys
[3:01] Number of IDs at 0.01 FDR: 58959
[3:02] Calculating protein q-values
WARNING: 458825 precursors were wrongly annotated in the library as proteotypic
[3:02] Number of proteins identified at 1% FDR: 24051 (precursor-level), 22838 (protein-level) (inference performed using proteotypic peptides only)
[3:02] Quantification
[3:03] Quantification information saved to /scratch/cpanse/PXD028735/dia/temp-2024-05-14_17-37-42/_scratch_cpanse_PXD028735_dia_LFQ_Orbitrap_AIF_Condition_A_Sample_Beta_01_mzML.quant.

How can this happen, since the speclib was genearted by DIA-NN itself starting from a Uniprot Fasta DB?

vdemichev commented 4 months ago

Hi Tobias,

This likely indicates that the .predicted.speclib was generated with protein inference set to 'Genes', while here it is set to protein names, hence the discrepancy in the proteotypicity definition.

Best, Vadim

tobiasko commented 4 months ago

Ahhhhhhh! So the --pg-level [N] default is 2 (gene) and this parameter also affects library prediction from a Uniprot FASTA file (GN=XXXX) ? BTW: What happens if a speclib from an external source (e.g. PROSIT) does not contain the proteotypicity information or even misses the protein entry the parent was derived from? Example in .msp format:

Name: MLGNMNVFMAVLGIILFSGFLAAYFSHK/2
MW: 1546.302144302
Comment: Parent=1546.30214430 Collision_energy=30 Mods=0 ModString=MLGNMNVFMAVLGIILFSGFLAAYFSHK///2 iRT=166.03
Num peaks: 40
147.11280823    0.1296  "y1/0.0ppm"
284.17172241    0.3208  "y2/0.0ppm"
245.13182068    0.2179  "b2/0.0ppm"
371.20373535    0.3588  "y3/0.0ppm"
302.15328979    0.0904  "b3/0.0ppm"
518.27215576    0.5050  "y4/0.0ppm"
416.19622803    0.2998  "b4/0.0ppm"
681.33551025    0.4583  "y5/0.0ppm"
547.23669434    0.3665  "b5/0.0ppm"
752.37261963    0.5639  "y6/0.0ppm"
661.27960205    0.3683  "b6/0.0ppm"
823.40972900    0.5475  "y7/0.0ppm"
760.34802246    0.7807  "b7/0.0ppm"
936.49377441    0.3506  "y8/0.0ppm"
907.41644287    0.1586  "b8/0.0ppm"
1083.56213379   0.1840  "y9/0.0ppm"
1038.45690918   0.2239  "b9/0.0ppm"
1140.58361816   0.6261  "y10/0.0ppm"
1109.49401855   0.2785  "b10/0.0ppm"
1227.61572266   0.9031  "y11/0.0ppm"
1208.56250000   0.2053  "b11/0.0ppm"
1374.68408203   1.0000  "y12/0.0ppm"
1321.64648438   0.1896  "b12/0.0ppm"
1487.76818848   0.9154  "y13/0.0ppm"
1378.66796875   0.1396  "b13/0.0ppm"
1600.85217285   0.7807  "y14/0.0ppm"
1491.75207520   0.1016  "b14/0.0ppm"
1713.93627930   0.2664  "y15/0.0ppm"
1604.83618164   0.0607  "b15/0.0ppm"
1770.95776367   0.9069  "y16/0.0ppm"
1717.92016602   0.0450  "b16/0.0ppm"
1884.04187012   0.5717  "y17/0.0ppm"
1864.98864746   0.0218  "b17/0.0ppm"
1983.11022949   0.2061  "y18/0.0ppm"
2054.14746094   0.1089  "y19/0.0ppm"
2185.18774414   0.1035  "y20/0.0ppm"
2332.25634766   0.1948  "y21/0.0ppm"
1166.63171387   0.0697  "y21^2/0.0ppm"
1273.18737793   0.0303  "y23^2/0.0ppm"
1424.23986816   0.0108  "y26^2/0.0ppm"

Should on always add --reannotate --pg-level [N] if one wants to compare stats on an aggregated level like protein group/gene?

vdemichev commented 4 months ago

Should on always add --reannotate --pg-level [N] if one wants to compare stats on an aggregated level like protein group/gene?

Makes sense indeed if the library missed protein info.