vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
259 stars 53 forks source link

Suggestions for Faster Data Processing in DIA-NN #775

Closed Aaron-Amunix closed 1 year ago

Aaron-Amunix commented 1 year ago

Hi Vadim,

My lab recently acquired a TIMS-TOF HT and has been subsequently scaling our experimental workflows and have hit a major bottleneck when processing data. As our experiments get bigger with a larger search space and ion mobility data the data processing times have greatly increased. Our current workflow takes over an hour to process each file for each pass and we are analyzing ~30 files per experiment. At this point, the processing times are as long as the LCMS experiment (~120min gradients/sample). I've included some of the parameters we are using for a library free workflow below. I'm not doing a traditional proteomics experiment, each "Protein" in my FASTA file is a ~15-20AA peptide (~500k of them) and I'm not looking for any cleavage (--cut "").

diann.exe --f files" --lib "" --threads 62 --verbose 1 --out "output.tsv" --qvalue 0.01 --matrices --out-lib "lib.tsv" --gen-spec-lib --predictor --fasta "fasta" --fasta-search --min-fr-mz 200 --max-fr-mz 1800 --cut K,R --missed-cleavages 5 --min-pep-len 5 --max-pep-len 20 --min-pr-mz 300 --max-pr-mz 1800 --min-pr-charge 2 --max-pr-charge 6 --double-search --no-prot-inf --reanalyse --relaxed-prot-inf --smart-profiling --peak-center --no-ifs-removal --cut ""

I have a few questions.

1.) How can I speed up processing time without compromising results?
2.) Does DIA-NN allow for a GPU based processing workflow similar to Bruker's TIMS-DIA-NN?
3.) What would the optimal setup look like for this sort of analysis? I have access to a state of the art Linux based cloud server with a lot of available GPU's and CPU's and I also have a local 32core Threadripper processing PC where I'm currently doing my analysis and processing times are based off of.

Thanks in advance for the suggestions and I really appreciate all the work you've done on DIA-NN. I'm very excited to try the new version you are working on when the next release comes out!

-Aaron

michaelsteidel86 commented 1 year ago

Switching from double-pass mode to single-pass mode should considerably speed-up your analysis w/O strong negative impact. At least for "standard" proteomics samples I've never observed a clear benefit from double-pass.

Have you generated your library "offline" and saved it for re-use for not needing to re-create it each time you trigger an analysis?

Sure you need up to 6 charges covered? For "standard" proteomics samples 2-3 in most cases works equally well.

Are your peptides actually "non-tryptic"? If think the DIA-NN predicition is optimized for tryptic peptides..

Why have you specified cut KR and 5 missed cleavages if your not subjecting to trypsin at all?

Aaron-Amunix commented 1 year ago

Thanks for the reply Michael,

See my responses below.

Single Pass vs Double Pass: I'll give this a try and see if the data is comparable. I've compared the first pass intermediate-data results vs the second pass final results and noticed that the data was much more sparse but I've never tried single pass only with a complete dataset.

Offline Library Generation: Good idea, I'll implement this especially as my library gets larger but it doesn't help with the long processing time per file which is the main bottleneck.

Charge States: I usually get charge states of +2-+4 with my peptides, I can altering this accordingly. Thank you for the input.

Non-Tryptic Peptides: Yes, I do have a lot of non-tryptic peptides (ending in LFQ/other c-termini) but have actually been making tryptic peptides (ending in K or R) and the data quality has been a lot better, this might be why...I'm actually starting another experiment where the c-terminus of each peptide is "EA", hopefully it'll work well enough.

Why have you specified cut KR and 5 missed cleavages: With the GUI, I don't have the ability to remove this, even with the --cut "" command. I can remove the missed cleavages and see if that helps speed things up, thanks for this. I'm pretty sure the --cut "" overrides cut KR.

michaelsteidel86 commented 1 year ago

Hi Aaron,

i am not refering to single-vs-double pass (MBR) anslyses. Should stick to main reports, as the 2nd pass MBR run is very fast anyway and provides higher data completeness than first pass report.

Your command „ --double-search“ indicates (if I am not wrong) you have selected the time-consuming double pass neural network classifier. Try default single-pass instead in the GUI. If you use the command line tool, remove this command.

michaelsteidel86 commented 1 year ago

Hi Aaron,

i am not refering to single-vs-double pass (MBR) anslyses. Should stick to main reports, as the 2nd pass MBR run is very fast anyway and provides higher data completeness than first pass report.

Your command „ --double-search“ indicates (if I am not wrong) you have selected the time-consuming double pass neural network classifier. Try default single-pass instead in the GUI. If you use the command line tool, remove this command.

michaelsteidel86 commented 1 year ago

Note: in most cases max missed cleavage of 1 is sufficient. Never have seen a benefit from >1

vdemichev commented 1 year ago

Some comments in addition to Michael's:

  1. Create a DIA-based spectral library from a subset of your experiment (typically, no point at all to go beyond several hundred runs randomly selected from the experiment for this), then analyse the whole experiment using this library without MBR. Use Library generation set to 'IDs, RTs and IMs profiling'.
  2. No, we don't see a benefit in this.
  3. Threadripper should work nice
Aaron-Amunix commented 1 year ago

Thanks Vadim for the suggestions. We are actually analyzing the same exact peptides in multiple experiments (more of a high throughput screen). To speed things up, I can also try to create a sample specific DIA Assay Library and use that assay library for each experiment containing that peptide library (like we used to routinely for old-school DIA-MS). That will save a ton of time but kind of defeats the purpose of library-free DIA-MS. I'll try both approaches and see what performs best across experiments.

Really appreciate the comments from both of you. I'll close the issue and will reach out if I have more questions/concerns.