Slow run time and issue with running run_DESeq_stats.R

Anto007 commented 4 years ago

Hi, I'm a new user of samsa2 and just so as to test, I used just two metatranscriptome libraries (approx. 16 Mio x 2 paired-end reads in each of the two libraries) but it it took 2 full days to reach until SUBSYS_AGGREG checkpoint on my High-performance PC (245G available total memory; 16TB available space and no other major computational jobs running at the time of running samsa2 pipeline). I wonder how I might be able to scale up the analysis speed when I'll need to input a much higher number of metatranscriptomic libraries? I think the DIAMOND step is taking the most time (understandably) but I'm curious to know if this rather slow speed of analysis is just encountered by only me and not you or the other users?

Also, I get the below error while run_DESeq_stats.R is called from the master_script_preserving_unmerged.sh.

_[1] "USAGE: $ run_DESeq_stats.R -I working_directory/ -O save.filename" Working directory is /data/samsa2/step_5_output/RefSeq_results/org_results Error in colnames<-(*tmp*, value = "SRR6493775") : attempt to set 'colnames' on an object with less than two dimensions Calls: colnames<- -> colnames<- Execution halted 'Rscript /data/samsa2/R_scripts/run_DESeq_stats.R -I /data/samsa2/step_5_output/RefSeq_results/org_results -O RefSeq_org_DESeq_results.tab -R /data/samsa2/step_2_output/rawcounts.txt' exited with non-zero status 1

This is how the first few lines of one of the files (control_SRR6493783.merged.RefSeq_annot_organism.tsv) at /data/samsa2/step_5_output/RefSeq_results/org_results look like: 14.8100107084 1206281 Bacteroides 14.0013343093 1140414 Bacteroides uniformis 6.71950210668 547306 Ruminococcus bromii 5.89554032774 480194 Bacteroides sp. 3.24016904525 263913 Faecalibacterium prausnitzii 3.06935339037 250000 Ruminococcus sp. 2.79296425627 227488 Bacteroides xylanisolvens 2.71861223975 221432 Eubacterium rectale 2.4190924585 197036 unclassified Bacteroides 2.41829442662 196971 Ruminococcus

This is the content of /data/samsa2/step_2_output/raw_counts.txt SRR6493775.cleaned.forward 13993245 SRR6493783.cleaned.forward 16517196

Any help here would be much appreciated!

transcript commented 4 years ago

Hello, there are certainly a couple of tricks to speed things up, and I suspect I know the reason for your issue with the R script.

Regarding speed: You could try using the -p option on DIAMOND to specify the number of threads to use. The program should auto-detect and use all threads by default, but you can try specifying to see if it's not properly assessing the number.

You could also look at changing the e-value parameter (-e, default is 0.001) to a more stringent value to see if that helps with a speed increase.

Finally, you could look at upping the block size, which is set at -b2.0, but could be upped to -b10 or so.

Regarding the error you got, the issue with R is that it needs to know which files are the control samples, and which ones are the experimental samples, for DESeq comparison purposes. You can fix this by adding "control" and "experimental" as prefixes to the files - although note that you'll need at least 2 samples of each category for DESeq to be able to determine the level of in-group vs. out-group change.

Happy to help with other questions as they arise. You may also want to check out the SAMSA Google Group: https://groups.google.com/u/1/g/samsa-bioinformatics-group?pli=1 . If you send an email there, I'll see it as well.

Anto007 commented 4 years ago

Many thanks for your prompt response and tips. I'll try your suggested parameters to speed up DIAMOND. I did add "control" and "experimental" prefixes to the files but I guess I should have remembered to include at least 2 libraries per category (I just had two libraries in total..darn). The way I'm now adding prefixes to the filenames is when I see an error message regarding the same at the final step of running master_script_preserving_unmerged.sh. I'm assuming this is the only way to do it for now? Perhaps, future versions of your tool can take care of this within the master script itself. Thanks also for the google groups link; I'm pleased to see a very active user forum there.

transcript / samsa2

Slow run time and issue with running run_DESeq_stats.R #53