Parameters for DiaNN library free phospho search

aparnasrini03 commented 1 year ago

Hello! I would like to run a library free search on some diaPASEF data of phospho enriched samples. Would you be able to suggest the optimal parameters for this?

So far, I have run DiaNN v. 1.8.1 using Singularity on a computing cluster. The diaPASEF file used in question is a 2 GB diaPASEF file, 30 min LC gradient of a phospho enriched Hela cell sample. Thank you in advance for your help!

Full human proteome in silico digest, considering only peptides with phosphorylation. However, 0 precursors are generated, and therefore no precursors are extracted from the data. I'm not sure if this is due to an error with the precursor generation when the --mod-only flag is provided. Is there a way to only consider phosphorylated precursors? Log output below:

DIA-NN 1.8.1 (Data-Independent Acquisition by Neural Networks)
Compiled on Apr 15 2022 08:45:18
Current date and time: Fri Feb  3 14:33:01 2023
Logical CPU cores: 80

/usr/diann/1.8.1/diann --f diapasef_phospho_file.d --lib  --threads 2 --verbose 5 --out test_report.tsv --qvalue 0.01 \
--matrices --out-lib test_diann_hela_library.tsv --gen-spec-lib --fasta UP000005640_9606.fasta --fasta-search \
--min-fr-mz 200 --max-fr-mz 1800 --met-excision --cut K*,R* --missed-cleavages 1 --min-pep-len 7 --max-pep-len 30 \
--min-pr-mz 300 --max-pr-mz 1800 --min-pr-charge 2 --max-pr-charge 4 --mass-acc 10 --mass-acc-ms1 10 --max-fr 6 \
--fixed-mod UniMod:4,57.021464,C - carboxyamidomethylation --var-mod UniMod:21,79.966331,STY - phosphorylation \
--var-mod UniMod:1,42.010565,*n - N-terminal protein acetylation --var-mod UniMod:35,15.994915,M - oxidation \
--var-mods 2 --no-isotopes --reannotate --smart-profiling --monitor-mod phosphorylation --mod-only 

Thread number set to 2
Output will be filtered at 0.01 FDR
Precursor/protein x samples expression level matrices will be saved along with the main report
A spectral library will be generated
Library-free search enabled
Min fragment m/z set to 200
Max fragment m/z set to 1800
N-terminal methionine excision enabled
In silico digest will involve cuts at K*,R*
Maximum number of missed cleavages set to 1
Min peptide length set to 7
Max peptide length set to 30
Min precursor m/z set to 300
Max precursor m/z set to 1800
Min precursor charge set to 2
Max precursor charge set to 4
Maximum number of fragments set to 6
Modification UniMod:4 with mass delta 57.0215 at C - carboxyamidomethylation will be considered as fixed
Modification UniMod:21 with mass delta 79.9663 at STY - phosphorylation will be considered as variable
Modification UniMod:1 with mass delta 42.0106 at *n - N-terminal protein acetylation will be considered as variable
Modification UniMod:35 with mass delta 15.9949 at M - oxidation will be considered as variable
Maximum number of variable modifications set to 2
Isotopologue chromatograms will not be used
Library precursors will be reannotated using the FASTA database
When generating a spectral library, in silico predicted spectra will be retained if deemed more reliable than experimental ones
Only peptides bearing the modifications specified with --monitor-mod will be considered
Mass accuracy will be fixed to 1e-05 (MS2) and 1e-05 (MS1)
Exclusion of fragments shared between heavy and light peptides from quantification is not supported in FASTA digest mode - disabled; to enable, generate an in silico predicted spectral library and analyse with this library
WARNING: it's strongly recommended to use deep learning spectra/RTs prediction. If impossible because some unsupported modifications need to be searched, consider using either (i) the --strip-unknown-mods command or (ii) a 'training library' specified with the --learn-lib command. A training library might facilitate about 1.5x more IDs, deep learning - about 2x-3x more IDs.

1 files will be processed
[0:00] Loading FASTA /UP000005640_9606.fasta
[0:16] Processing FASTA
[0:16] Assembling elution groups
[0:16] 0 precursors generated
[0:16] Library contains 0 proteins, and 0 genes
[0:16] Initialising library

[0:16] File #1/1
[0:16] Loading run diapasef_phospho_file.d
[0:24] Detected MS/MS range: 99.9986 - 1700
[0:24] Run loaded
[0:24] 0 library precursors are potentially detectable
[0:24] Processing
[0:24] Precursor search
[0:24] Optimising weights
Ids at 10% FDR using TC scoring: 0
Ids at 10% FDR using TC selection: 0
WARNING: too few training precursors, classifier will not be used
[0:24] Calculating q-values

...ETC

[0:25] Cross-run analysis
[0:25] Reading quantification information: 1 files
[0:25] Quantifying peptides
[0:25] Quantifying proteins
[0:25] No protein annotation, skipping protein q-value calculation
[0:25] No protein annotation, skipping global protein q-value calculation
[0:25] Writing report
[0:25] Report saved to test_report.tsv.
[0:25] Saving precursor levels matrix
[0:25] Precursor levels matrix (1% precursor and protein group FDR) saved to /test_report.pr_matrix.tsv.
[0:25] Saving protein group levels matrix
[0:25] Saving gene group levels matrix
[0:25] Saving unique genes levels matrix
[0:25] Stats report saved to test_report.stats.tsv
[0:25] Generating spectral library:
[0:25] 0 precursors passing the FDR threshold are to be extracted
[0:25] Saving spectral library to test_library.tsv
[0:25] 0 precursors saved
[0:25] Loading the generated library and saving it in the .speclib format
[0:25] Loading spectral library test_library.tsv
[0:25] Spectral library loaded: 0 protein isoforms, 0 protein groups and 0 precursors in 1 elution groups.
[0:25] Loading protein annotations from FASTA UP000005640_9606.fasta
[0:25] Library contains 0 proteins, and 0 genes
[0:25] Saving the library to test_library.tsv.speclib
[0:25] Log saved to test_report.log.txt
Finished

Full human proteome in silico digest, considering all non modified peptides and variable modifications of phosphorylation, oxidation and n-terminal acetylation. I tried to be conservative in terms of how many precursors are generated (max 2 variable modifications, 1 missed cleavage, charge states 2-4). However, this is taking a very long time (i provided 15 hours for the job and it timed out at 3300/10000 batches), and very few precursors are detected even at 50% FDR - I guess these issues are probably due to the high number of precursors. Would adjusting the thread number improve the time?

Log output below:

DIA-NN 1.8.1 (Data-Independent Acquisition by Neural Networks)
Compiled on Apr 15 2022 08:45:18
Current date and time: Thu Feb  2 12:13:02 2023
Logical CPU cores: 80

/usr/diann/1.8.1/diann --f diapasef_phospho_file.d --lib  --threads 2 --verbose 5 --out test_report.tsv --qvalue 0.01 \
--matrices --out-lib test_diann_hela_library.tsv --gen-spec-lib --fasta UP000005640_9606.fasta --fasta-search \
--min-fr-mz 200 --max-fr-mz 1800 --met-excision --cut K*,R* --missed-cleavages 1 --min-pep-len 7 --max-pep-len 30 \
--min-pr-mz 300 --max-pr-mz 1800 --min-pr-charge 2 --max-pr-charge 4 --mass-acc 10 --mass-acc-ms1 10 --max-fr 6 \
--fixed-mod UniMod:4,57.021464,C - carboxyamidomethylation --var-mod UniMod:21,79.966331,STY - phosphorylation \
--var-mod UniMod:1,42.010565,*n - N-terminal protein acetylation --var-mod UniMod:35,15.994915,M - oxidation \
--var-mods 2 --no-isotopes --reannotate --smart-profiling

Thread number set to 2
Output will be filtered at 0.01 FDR
Precursor/protein x samples expression level matrices will be saved along with the main report
A spectral library will be generated
Library-free search enabled
Min fragment m/z set to 200
Max fragment m/z set to 1800
N-terminal methionine excision enabled
In silico digest will involve cuts at K*,R*
Maximum number of missed cleavages set to 1
Min peptide length set to 7
Max peptide length set to 30
Min precursor m/z set to 300
Max precursor m/z set to 1800
Min precursor charge set to 2
Max precursor charge set to 4
Maximum number of fragments set to 6
Modification UniMod:4 with mass delta 57.0215 at C - carboxyamidomethylation will be considered as fixed
Modification UniMod:21 with mass delta 79.9663 at STY - phosphorylation will be considered as variable
Modification UniMod:1 with mass delta 42.0106 at *n - N-terminal protein acetylation will be considered as variable
Modification UniMod:35 with mass delta 15.9949 at M - oxidation will be considered as variable
Maximum number of variable modifications set to 2
Isotopologue chromatograms will not be used
Library precursors will be reannotated using the FASTA database
When generating a spectral library, in silico predicted spectra will be retained if deemed more reliable than experimental ones
Mass accuracy will be fixed to 1e-05 (MS2) and 1e-05 (MS1)
Exclusion of fragments shared between heavy and light peptides from quantification is not supported in FASTA digest mode - disabled; to enable, generate an in silico predicted spectral library and analyse with this library
WARNING: it's strongly recommended to use deep learning spectra/RTs prediction. If impossible because some unsupported modifications need to be searched, consider using either (i) the --strip-unknown-mods command or (ii) a 'training library' specified with the --learn-lib command. A training library might facilitate about 1.5x more IDs, deep learning - about 2x-3x more IDs.

1 files will be processed
[0:00] Loading FASTA UP000005640_9606.fasta
[0:18] Processing FASTA
[3:32] Assembling elution groups
[5:48] 74160634 precursors generated
[5:48] Gene names missing for some isoforms
[5:48] Library contains 20566 proteins, and 20339 genes
[5:51] Initialising library

[7:00] File #1/1
[7:00] Loading run diapasef_phospho_file.d
[10:17] Detected MS/MS range: 99.9986 - 1700
[10:25] Run loaded
[10:52] 61361292 library precursors are potentially detectable
[10:59] Processing batch #1 out of 10000 
[10:59] Precursor search
[11:21] Optimising weights
Ids at 10% FDR using TC scoring: 0
Ids at 10% FDR using TC selection: 0
Averages: 
0.00550878 0.00607197 0 0.00434137 0 -0.00285502 

...
[686:19] Processing batches #2757-3308 out of 10000 
[686:19] Precursor search
[813:55] Optimising weights
Ids at 10% FDR using TC scoring: 1
Ids at 10% FDR using TC selection: 1
Averages: 
-0.00568116 0.00152638 0 -0.050625 0 0.0244782 
Weights: 
4.69546 0.0609991 0 0.611583 0 0.32876 
[814:02] Calculating q-values
[814:19] Number of IDs at 50%, 5%, 1%, 0.1% FDR: 4, 0, 0, 0
[814:19] Calculating q-values
[814:35] Number of IDs at 50%, 5%, 1%, 0.1% FDR: 8, 0, 0, 0
[814:35] Calibrating retention times
[814:36] 50 precursors used for iRT estimation.
[814:36] Processing batches #3309-3970 out of 10000 
[814:36] Precursor search
 error: *** JOB CANCELLED AT DUE TO TIME LIMIT ***

vdemichev commented 1 year ago

Best to follow precisely the guidance in the DIA-NN docs. Specifically for phospho, what works well:

3 var mods
1 missed cleavage
precursor charge range 2 - 3 or 2 - 4
precursor mass range - that of your experiment
generation of an in silico library done in a separate pipeline step (might take days)
Library generation set to IDs, RT and IM profiling
Mass accuracies fixed as recommended for dia-PASEF, scan window fixed as recommended in the docs
Everything else kept default Once the above works, can experiment with adding extra stuff.

hahahahhhhahaha commented 9 months ago

Do I need to check the Phospho when I use the FASTA file to generate an in silico library , or do I only check the Phospho when analyzing my raw files?

vdemichev / DiaNN

Parameters for DiaNN library free phospho search #610