identified peptides missing from the spectral library

zhakhverdyan-food commented 2 years ago

Hi Vadim, thank you so much for your lab's hard work. DIA-NN is an intuitive and seamless tool that is a pleasure to use even for a novice as myself!

I have a bit of a mystery on my hands. I am conducting a library free search of one DIA sample. One question we had was what is the fraction of identified precursors of the universe of potential precursor's in the spectral library (generated from human proteome fasta file digest). To my surprise, ~80% of precursor sequences in the report.tsv file were missing from the predicted.speclib file although were present in the report-lib.tsv file.

Do you have any suggestions how this could happen? The size of the spectral library was 1.3GB and 3.7GB after .tsv conversion. Does the size sound reasonable for the human proteome? I am attaching the output of DIANN log:

Compiled on Jun 28 2021 14:55:31
Current date and time: Wed May 25 13:48:42 2022
CPU: GenuineIntel Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
SIMD instructions: AVX AVX2 FMA SSE4.1 SSE4.2 
Logical CPU cores: 4
diann.exe --f C:\control\test.mzML  --lib  --threads 4 --verbose 1 --out C:\DIA-NN\1.8\report.tsv --qvalue 0.01 --out-lib C:\DIA-NN\1.8\report-lib.tsv --gen-spec-lib --predictor --fasta C:\control\UP000005640_9606.fasta --fasta-search --min-fr-mz 200 --max-fr-mz 1800 --cut K*,R* --missed-cleavages 1 --min-pep-len 7 --max-pep-len 35 --min-pr-mz 300 --max-pr-mz 1800 --min-pr-charge 2 --max-pr-charge 6 --unimod4 --var-mods 1 --var-mod UniMod:35,15.994915,M --reanalyse --smart-profiling 

Thread number set to 4
Output will be filtered at 0.01 FDR
A spectral library will be generated
Deep learning will be used to generate a new in silico spectral library from peptides provided
Library-free search enabled
Min fragment m/z set to 200
Max fragment m/z set to 1800
In silico digest will involve cuts at K*,R*
Maximum number of missed cleavages set to 1
Min peptide length set to 7
Max peptide length set to 35
Min precursor m/z set to 300
Max precursor m/z set to 1800
Min precursor charge set to 2
Max precursor charge set to 6
Cysteine carbamidomethylation enabled as a fixed modification
Maximum number of variable modifications set to 1
Modification UniMod:35 with mass delta 15.9949 at M will be considered as variable
A spectral library will be created from the DIA runs and used to reanalyse them; .quant files will only be saved to disk during the first step
When generating a spectral library, in silico predicted spectra will be retained if deemed more reliable than experimental ones
DIA-NN will optimise the mass accuracy automatically using the first run in the experiment. This is useful primarily for quick initial analyses, when it is not yet known which mass accuracy setting works best for a particular acquisition scheme.
WARNING: MBR turned off, two or more raw files are required

1 files will be processed
[0:00] Loading FASTA C:\control\UP000005640_9606.fasta
[0:06] Processing FASTA
[0:22] Assembling elution groups
[0:34] 4839748 precursors generated
[0:34] Gene names missing for some isoforms
[0:34] Library contains 20576 proteins, and 20325 genes
[0:35] [0:50] [65:29] [75:20] [76:03] [76:07] Saving the library to C:\DIA-NN\1.8\report-lib.predicted.speclib
[76:17] Initialising library

[76:21] File #1/1
[76:21] Loading run C:\control\test.mzML
[76:39] 4236102 library precursors are potentially detectable
[76:39] Processing...
[81:35] RT window set to 0.583141
[81:35] Peak width: 2.88
[81:35] Scan window radius set to 6
[81:36] Recommended MS1 mass accuracy setting: 6.09284 ppm
[89:35] Optimised mass accuracy: 17.8871 ppm
[92:32] Removing low confidence identifications
[92:32] Removing interfering precursors
[92:33] Training neural networks: 7476 targets, 5223 decoys
[92:35] Number of IDs at 0.01 FDR: 3845
[92:35] Calculating protein q-values
[92:36] Number of genes identified at 1% FDR: 272 (precursor-level), 169 (protein-level) (inference performed using proteotypic peptides only)
[92:36] Quantification
[92:37] Quantification information saved to C:\control\test.mzML.quant.

[92:37] Cross-run analysis
[92:37] Reading quantification information: 1 files
[92:37] Quantifying peptides
[92:37] Assembling protein groups
[92:41] Quantifying proteins
[92:42] Calculating q-values for protein and gene groups
[92:42] Calculating global q-values for protein and gene groups
[92:43] Writing report
[92:43] Report saved to C:\DIA-NN\1.8\report.tsv.
[92:43] Stats report saved to C:\DIA-NN\1.8\report.stats.tsv
[92:43] Generating spectral library:
[92:43] Reading quantification information: 1 files
[92:43] Assembling protein groups
[92:47] 3845 precursors passing the FDR threshold are to be extracted
[92:47] Loading run C:\control\test.mzML
[93:05] 4236102 library precursors are potentially detectable
[93:06] 3348 spectra added to the library
[93:06] Saving spectral library to C:\DIA-NN\1.8\report-lib.tsv
[93:07] 3845 precursors saved
[93:07] Loading the generated library and saving it in the .speclib format
[93:07] Loading spectral library C:\DIA-NN\1.8\report-lib.tsv
[93:07] Spectral library loaded: 345 protein isoforms, 325 protein groups and 3845 precursors in 2835 elution groups.
[93:07] Loading protein annotations from FASTA C:\control\UP000005640_9606.fasta
[93:08] Gene names missing for some isoforms
[93:08] Library contains 345 proteins, and 342 genes
[93:08] Saving the library to C:\DIA-NN\1.8\report-lib.tsv.speclib

vdemichev commented 2 years ago

Looks like some problem with .speclib to .tsv conversion... I would repeat it, also make sure enough space on the disk.

vdemichev commented 2 years ago

Ton convert .speclib to .tsv, specify .speclib in the Spectral library field and the deisred .tsv file in the Output library field, without (it's important) specifying any raw data, and press Run.

zhakhverdyan-food commented 2 years ago

Thank you for the prompt reply! I will free up some space and try again.

zhakhverdyan-food commented 2 years ago

Hi Vadim, yes indeed, space was the issue! After .tsv conversion the file expanded ~13-fold. Thank you so much and all the best!

zhakhverdyan-food commented 2 years ago

100% of peptides in the report file matched with .tsv spectral library peptides.

vdemichev / DiaNN

identified peptides missing from the spectral library #408