vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
261 stars 53 forks source link

Library free search #491

Open santoshdbhosale opened 2 years ago

santoshdbhosale commented 2 years ago

Hi Vadim,

I am currently analyzing the dia-PASEF data for serum samples. Following are the steps I used to do the analysis.

  1. I builded the spectral library using FASTA sequences.
  2. Please see the attached screen shot of parameters for actual file analysis.
  3. I am wondering if I have to enable the precursor ion generation and generate the refined spectral library again?
  4. Additionally, I have 105 files to process and I did give try (with few files) by converting the .d files to .dia but it did not improve the speed. Do you have any suggestions on this? The RAM of our data processing computer is 64GB.

Parameters

Thank you, Santosh

vdemichev commented 2 years ago

Hi Santosh,

No, once you obtained a predicted spectral library (.predicted.speclib), FASTA digest or deep learning spectra prediction should not be used again.

Yes, speed like this is fine for lib-free search of dia-PASEF. I cannot see what kind of PC you have, but on 8-cores that would be something expected.

Best, Vadim

santoshdbhosale commented 2 years ago

Hi Vadim,

Thank you for the quick response. I was wondering to generate the refined library using FASTA predicated (.predicted.speclib) when the real samples dia-PASEF files are loaded. Does this make any sense at all? Thanks, Santosh

vdemichev commented 2 years ago

This is automatic with MBR

santoshdbhosale commented 2 years ago

But if you see the above screenshot, the option of generate spectral library is not enabled.

vdemichev commented 2 years ago

Means it was 'unclicked'. Checking MBR automatically enables it.

santoshdbhosale commented 2 years ago

So should I restart the search again

vdemichev commented 2 years ago

Yes

santoshdbhosale commented 2 years ago

Okay. Thank you. So, do you recommend to search the .d files again with the refined spectral library (builded out of dia files)? Thanks, Santosh

vdemichev commented 2 years ago

The easiest is to generate an in silico predicted spectral library and process the entire experiment with it with MBR enabled.

GeorgeCMarinescu commented 2 years ago

Hi Vadim, I am experiencing 2 issues:

  1. I am trying to make a lib out of 2 DIA acquisition files (64 variable windows, on sciex triple tof 5600plus). 90 minutes long. I run on a Xeon server with 128 Gb RAM and 8 core hyperthreading (16 virtual core) machine. We use mouse fasta proteome file for mus musculus. It is damn slow, it did not process it in more than 3 days.

here is the command line: diann.exe --f "M:\sciex\swath data\O4_5uL_7sep22.wiff " --f "M:\sciex\swath data\O2_5uL_7sep22.wiff " --lib "" --threads 16 --verbose 2 --out "M:\DIA-NN\5uL_O2_O4\report.tsv" --qvalue 0.01 --matrices --temp "M:\DIA-NN\5uL_O2_O4\tmp" --out-lib "M:\DIA-NN\5uL_O2_O4" --gen-spec-lib --predictor --fasta "M:\our fuckin data\fasta_files\uniprot_mus_musculus_filtered_reviewed.fasta" --fasta-search --min-fr-mz 100 --max-fr-mz 1600 --met-excision --cut K,R,!P --missed-cleavages 2 --min-pep-len 7 --max-pep-len 30 --min-pr-mz 400 --max-pr-mz 1250 --min-pr-charge 1 --max-pr-charge 5 --unimod4 --var-mods 2 --var-mod UniMod:35,15.994915,M --var-mod UniMod:1,42.010565,n --monitor-mod UniMod:1 --var-mod UniMod:21,79.966331,STY --monitor-mod UniMod:21 --var-mod UniMod:121,114.042927,K --monitor-mod UniMod:121 --no-cut-after-mod UniMod:121 --mass-acc 15 --mass-acc-ms1 15 --double-search --reanalyse --relaxed-prot-inf --smart-profiling --no-ifs-removal --no-norm

and here is the log output so far:

Current date and time: Thu Sep 8 22:53:14 2022 CPU: GenuineIntel Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz SIMD instructions: AVX SSE4.1 SSE4.2 Logical CPU cores: 16 Thread number set to 16 Output will be filtered at 0.01 FDR Precursor/protein x samples expression level matrices will be saved along with the main report A spectral library will be generated Deep learning will be used to generate a new in silico spectral library from peptides provided Library-free search enabled Min fragment m/z set to 100 Max fragment m/z set to 1600 N-terminal methionine excision enabled In silico digest will involve cuts at K,R But excluding cuts at P Maximum number of missed cleavages set to 2 Min peptide length set to 7 Max peptide length set to 30 Min precursor m/z set to 400 Max precursor m/z set to 1250 Min precursor charge set to 1 Max precursor charge set to 5 Cysteine carbamidomethylation enabled as a fixed modification Maximum number of variable modifications set to 2 Modification UniMod:35 with mass delta 15.9949 at M will be considered as variable Modification UniMod:1 with mass delta 42.0106 at n will be considered as variable Modification UniMod:21 with mass delta 79.9663 at STY will be considered as variable Modification UniMod:121 with mass delta 114.043 at K will be considered as variable Neural networks will be used for peak selection A spectral library will be created from the DIA runs and used to reanalyse them; .quant files will only be saved to disk during the first step Highly heuristic protein grouping will be used, to reduce the number of protein groups obtained; this mode is recommended for benchmarking protein ID numbers; use with caution for anything else When generating a spectral library, in silico predicted spectra will be retained if deemed more reliable than experimental ones Interference removal from fragment elution curves disabled Normalisation disabled Mass accuracy will be fixed to 1.5e-05 (MS2) and 1.5e-05 (MS1) Exclusion of fragments shared between heavy and light peptides from quantification is not supported in FASTA digest mode - disabled; to enable, generate an in silico predicted spectral library and analyse with this library The following variable modifications will be scored: UniMod:1 UniMod:21 UniMod:121 WARNING: double-pass mode is incompatible with PTM scoring, turned off DIA-NN will discard peptides obtained using in silico cuts after the following modifications: UniMod:121,

2 files will be processed [0:00] Loading FASTA M:\our fuckin data\fasta_files\uniprot_mus_musculus_filtered_reviewed.fasta [0:28] Processing FASTA [5:58] Assembling elution groups [11:06] 46170443 precursors generated [11:06] Gene names missing for some isoforms [11:06] Library contains 17066 proteins, and 16696 genes [11:21] Encoding peptides for spectra and RTs prediction [15:39] Predicting spectra and IMs

Total RAM usage of the machine is 31% (37GB), cpu usage stays around 13%.

Any idea what I am doing wrong ?

  1. Also I was trying to identify proteins from an unknown bacteria, on a similar machine. I have loaded a fasta resulted by concatenating all bacteria fasta from uniprot. (total 14 Gb). It starts, loads fasta, it runs half a day, then the whole machine hangs. Maybe I should load each small fasta file in a separate job ?

Regards, George

vdemichev commented 2 years ago

Hi George,

Yes, in silico prediction for 46 million precursors is slow. Why are you searching for both phospho & ubiquitin? The sample is enriched for both? Anyway, reducing precursor charge range to 2-3, and restricting precursor mass range to the actual range acquired in the runs will speed things up. NOT enabling M(ox) will also help. For phospho you'd probably want max 3 var mods, not 2.

Yes, 14Gb FASTA is not a good idea to search against probably :) Just too large to predict in silico. I would suggest to try to search against huge databases using FragPipe, which is integrated with DIA-NN.

Best, Vadim

santoshdbhosale commented 2 years ago

Thank you Vadim :)

GeorgeCMarinescu commented 2 years ago

Thank you, Vadim.

I followed your advice, removed both phospho & ubiquitin and managed to get 2076 respective 2049 identified proteins from 2 runs 90 minute gradient, 5ug peptides each, mouse Myocyte whole cells total protein trypsin digest. Total dia-nn runtime was 97 minutes, log level 5 (I like to see what's going on!). Now I'll try longer gradient and then high pH fractionation. I run swath on triple tof 5600+ with 64 variable windows over a range of 400 - 1250 m/z. Also will try GFP. Is there an easy way to combine all the resulted libs in one ?

Regards, George

vdemichev commented 1 year ago

Hi George,

Can load multiple libraries into DIA-NN with multiple --lib commands. But need to make sure that RT scale in all of them is the same.

Best, Vadim