Modifications after extensively iterations on Virtual Machine

Hey there :wave:

The overall code was almost ready to use, it wasn't necessary to tweak the modules, only the module configuration and the cohort analysis workflow.

Very interestingly, the parallel run wasn't very optimized so it took the same time as normal run.

Tasks done:

Add specific processes names to modules.config
Modify genome names handling on cohort workflow

Generate a new overflow:

MTBSEQ_NF:QC:FASTQC
MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBBWA
MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBREFINE
MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBPILE
MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBLIST
MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBVARIANTS
MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBSTATS
MTBSEQ_NF:PARALLEL_ANALYSIS:PER_SAMPLE_ANALYSIS:TBSTRAINS
MTBSEQ_NF:PARALLEL_ANALYSIS:COHORT_ANALYSIS:TBJOIN
MTBSEQ_NF:PARALLEL_ANALYSIS:COHORT_ANALYSIS:TBAMEND
MTBSEQ_NF:PARALLEL_ANALYSIS:COHORT_ANALYSIS:TBGROUPS
MTBSEQ_NF:REPORT:MULTIQC

for this, I think that keeping the per_sample and cohort is important to specify which processes runs for each sample and which are joint analysis, even though tbstats and strains are joinning all samples before the cohort analysis

implement a base optimization before bigger benchmarks:

for this I've taken the following information after a first benchmark with 10 samples:

tbfull should be run with all the resources so mtbseq can use their own parallelization
we can split process with the following types - the ones that are benefited from multi-threading and the ones that aren't Processes that can be improved by multithreading: TBBWA TBFULL and TBREFINE form the single threaded processes some require a lot of RAM: TBLIST TBJOIN TBSTATS

begin TBJOIN a process that scales directly with sample size, in older benchmark for 1000 samples it reached 500GB ram usage.

If you have any comments or suggestions from any code implemented by this PR I'm available to discuss it. The information used for the benchmark is on "dev-iteration-5-parallel-benchmark-358" tower run.

Kindly, Davi

mycobactopia-org / MTBseq-nf

Modifications after extensively iterations on Virtual Machine #86