mycobactopia-org / MTBseq-nf

MTBSeq made simple and easy using Nextflow and nf-core standard.
https://doi.org/10.5281/zenodo.5498063
MIT License
8 stars 1 forks source link

Modifications after extensively iterations on Virtual Machine #86

Closed Mxrcon closed 4 months ago

Mxrcon commented 4 months ago

Hey there :wave:

The overall code was almost ready to use, it wasn't necessary to tweak the modules, only the module configuration and the cohort analysis workflow.

Very interestingly, the parallel run wasn't very optimized so it took the same time as normal run.

Tasks done:

  1. Add specific processes names to modules.config

  2. Modify genome names handling on cohort workflow

  3. Generate a new overflow:

    MTBSEQ_NF:QC:FASTQC
    MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBBWA
    MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBREFINE
    MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBPILE
    MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBLIST
    MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBVARIANTS
    MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBSTATS
    MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBSTRAINS
    MTBSEQ_NF:PARALLEL_ANALYSIS:COHORT_ANALYSIS:TBJOIN
    MTBSEQ_NF:PARALLEL_ANALYSIS:COHORT_ANALYSIS:TBAMEND
    MTBSEQ_NF:PARALLEL_ANALYSIS:COHORT_ANALYSIS:TBGROUPS
    MTBSEQ_NF:REPORT:MULTIQC

    for this, I think that keeping the per_sample and cohort is important to specify which processes runs for each sample and which are joint analysis, even though tbstats and strains are joinning all samples before the cohort analysis

  4. implement a base optimization before bigger benchmarks:

for this I've taken the following information after a first benchmark with 10 samples:

  1. tbfull should be run with all the resources so mtbseq can use their own parallelization
  2. we can split process with the following types - the ones that are benefited from multi-threading and the ones that aren't Processes that can be improved by multithreading: TBBWA TBFULL and TBREFINE form the single threaded processes some require a lot of RAM: TBLIST TBJOIN TBSTATS

begin TBJOIN a process that scales directly with sample size, in older benchmark for 1000 samples it reached 500GB ram usage.

If you have any comments or suggestions from any code implemented by this PR I'm available to discuss it. The information used for the benchmark is on "dev-iteration-5-parallel-benchmark-358" tower run.

Kindly, Davi

abhi18av commented 4 months ago

Thanks @Mxrcon , please do let me know when it is ready to be reviewed and I'll take a deeper look :)