vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
256 stars 53 forks source link

DIA-NN Phospho search issues #1032

Open JHKC12 opened 2 months ago

JHKC12 commented 2 months ago

Hi there,

I'm trying to run 4x phospho enriched and 4x global samples in DIA-NN (1.8.2 beta 39) but i am not sure if DIA-NN is bugged or if its still searching and taking a long time.

From what i have read in the forums, the workflow for phospho dia in DIA-NN is briefly as follows:

However, i am running the first search and the current log for the search is below: [0:00] Loading FASTA C:\Users\mproteomics\Desktop\Analysis Data\Sequences Databases\uniprot-proteome_UP000000589_MOUSE__55,086 (Dec 2023).fasta [0:27] Processing FASTA [4:13] Assembling elution groups [7:39] 50147387 precursors generated [7:39] Gene names missing for some isoforms [7:39] Library contains 54858 proteins, and 22143 genes [8:56] Encoding peptides for spectra and RTs prediction [13:04] Predicting spectra and IMs

I have been running this for about 2 days now and i am now sure if this is a DIA-NN bug or if i am doing something wrong

is there a workflow that explains how to perform phospho samples?

thanks in advanced

vdemichev commented 2 months ago

What's the amount of RAM in the system? Not enough RAM could be the only reason why it's slow in this case.

JHKC12 commented 2 months ago

Hi Vadim,

we have about 120gb RAM. I am in the process of getting the PC upgraded to more processing power.

But is the workflow above the correct way to go with phospho DIA?

vdemichev commented 2 months ago

We describe exact recommendations for phospho here now: https://github.com/vdemichev/DiaNN?tab=readme-ov-file#ptms-and-peptidoforms

JHKC12 commented 2 months ago

We describe exact recommendations for phospho here now: https://github.com/vdemichev/DiaNN?tab=readme-ov-file#ptms-and-peptidoforms

Hi Vadim,

we upgraded our PC with higher processing power and also used the latest version of DIA-NN with the settings used as per the recommendations provided (using 'ultrafast' as well). However, the search seems to always be stuck at the step "[16:54] Predicting spectra and IMs".

Is it common for this step of the workflow to take long in the generation of the spectral library with 3 variable modifcations?

vdemichev commented 2 months ago

On a Ryzen 7950X (16 cores) or 10980XE (18 cores) it takes about 3 minutes per million precursors. That is, a rough time to generate a 50-million precursor library on such a PC would be ~150 minutes - with DIA-NN 1.9. So no, it should not take much longer. Can you please share the full log generated so far using the 'Save log' button?

JHKC12 commented 2 months ago

On a Ryzen 7950X (16 cores) or 10980XE (18 cores) it takes about 3 minutes per million precursors. That is, a rough time to generate a 50-million precursor library on such a PC would be ~150 minutes - with DIA-NN 1.9. So no, it should not take much longer. Can you please share the full log generated so far using the 'Save log' button?

heres the log saved so far:


Skyline not found MSFileReader found: MSFileReader Core 31

diann.exe --lib "" --threads 32 --verbose 1 --out "C:\Users\mproteomics\Desktop\Analysis Data\Sequences Databases\SpecLib_FREE\Mouse_Phos.tsv" --qvalue 0.01 --matrices --out-lib "C:\Users\mproteomics\Desktop\Analysis Data\Sequences Databases\SpecLib_FREE\Mouse_Phos-lib.tsv" --gen-spec-lib --predictor --fasta "C:\Users\mproteomics\Desktop\Analysis Data\Sequences Databases\uniprot-proteome_UP000000589_MOUSE__55,086 (Dec 2023).fasta" --fasta-search --min-fr-mz 200 --max-fr-mz 1800 --met-excision --min-pep-len 7 --max-pep-len 30 --min-pr-mz 300 --max-pr-mz 1800 --min-pr-charge 2 --max-pr-charge 4 --cut K,R --missed-cleavages 1 --unimod4 --var-mods 3 --var-mod UniMod:21,79.966331,STY --peptidoforms --reanalyse --relaxed-prot-inf --rt-profiling DIA-NN 1.9 (Data-Independent Acquisition by Neural Networks) Compiled on Jun 8 2024 20:00:31 Current date and time: Mon Jun 24 13:14:29 2024 CPU: GenuineIntel Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz SIMD instructions: AVX AVX2 AVX512CD AVX512F FMA SSE4.1 SSE4.2 Logical CPU cores: 64 Thread number set to 32 Output will be filtered at 0.01 FDR Precursor/protein x samples expression level matrices will be saved along with the main report A spectral library will be generated Deep learning will be used to generate a new in silico spectral library from peptides provided Library-free search enabled Min fragment m/z set to 200 Max fragment m/z set to 1800 N-terminal methionine excision enabled Min peptide length set to 7 Max peptide length set to 30 Min precursor m/z set to 300 Max precursor m/z set to 1800 Min precursor charge set to 2 Max precursor charge set to 4 In silico digest will involve cuts at K,R Maximum number of missed cleavages set to 1 Cysteine carbamidomethylation enabled as a fixed modification Maximum number of variable modifications set to 3 Modification UniMod:21 with mass delta 79.9663 at STY will be considered as variable Peptidoform scoring enabled A spectral library will be created from the DIA runs and used to reanalyse them; .quant files will only be saved to disk during the first step Heuristic protein grouping will be used, to reduce the number of protein groups obtained; this mode is recommended for benchmarking protein ID numbers, GO/pathway and system-scale analyses The spectral library (if generated) will retain the original spectra but will include empirically-aligned RTs Exclusion of fragments shared between heavy and light peptides from quantification is not supported in FASTA digest mode - disabled; to enable, generate an in silico predicted spectral library and analyse with this library The following variable modifications will be scored: UniMod:21 WARNING: MBR turned off, two or more raw files are required

0 files will be processed [0:00] Loading FASTA C:\Users\mproteomics\Desktop\Analysis Data\Sequences Databases\uniprot-proteome_UP000000589_MOUSE__55,086 (Dec 2023).fasta [0:34] Processing FASTA [5:39] Assembling elution groups [10:38] 50147387 precursors generated [10:38] Gene names missing for some isoforms [10:38] Library contains 54858 proteins, and 22143 genes [11:22] Encoding peptides for spectra and RTs prediction [16:54] Predicting spectra and IMs


vdemichev commented 2 months ago

Thanks! Still, could it be that the RAM is all full? What is the physical RAM occupied amount shown by the Task Manager? This would be an explanation for it taking a very long time.

JHKC12 commented 2 months ago

it says its only using 5-10% of memory

vdemichev commented 2 months ago

50 million database will be tens of gigabytes, strange. I would try to restart DIA-NN and if takes long see what's the reported RAM consumption.

JHKC12 commented 2 months ago

i restarted DIANN and changed a few settings:

there has been further progress and is now saying:

0 files will be processed [0:00] Loading FASTA C:\Users\mproteomics\Desktop\Analysis Data\Sequences Databases\uniprot-proteome_UP000000589_MOUSE__55,086 (Dec 2023).fasta [0:21] Processing FASTA [2:19] Assembling elution groups [3:32] 20805637 precursors generated [3:32] Gene names missing for some isoforms [3:32] Library contains 54858 proteins, and 22143 genes [3:43] Encoding peptides for spectra and RTs prediction [4:50] Predicting spectra and IMs [926:24] Predicting RTs

this has been running for about 15 hours now so i am not sure if this is still taking too long. the memory usage is now at around 30%

vdemichev commented 2 months ago

This is indeed quite strange. The CPU load in Task Manager does correspond to what you'd expect, based on the number of threads set, and no other high-CPU tasks are run on the machine at the same time?

JHKC12 commented 1 month ago

This is indeed quite strange. The CPU load in Task Manager does correspond to what you'd expect, based on the number of threads set, and no other high-CPU tasks are run on the machine at the same time?

Not that i was able to see. we managed to get through the data but it looks to take a lot longer than what others have experienced