nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
187 stars 117 forks source link

Parallelization processes in DADA2_DENOISE and RM_CHIMERAS #523

Closed sgaleraalq closed 1 year ago

sgaleraalq commented 1 year ago

Description of feature

Hi guys!

I was wondering if there is any command or options that you can add in the .config file to add parallelization in the DADA2_DENOISE and RM_CHIMERAS step, i.e. send every sample to a different node and merge all of them after they have finished with the processes. I have seen that some processes, like FILTNTRIM does that in a cluster and if it could be possible to implement it in those pipeline steps it will speed up the running time of the workflow significantly.

Thank you very much!

d4straub commented 1 year ago

Hi there, thanks for the suggestion.

I do not think it is possible to use the config to parallelize those steps currently. You can however use the column "run" in the samplesheet to split samples into batches and process in parallel (see https://nf-co.re/ampliseq/2.4.1/usage#samplesheet-input). But that will also split those earlier for calculating the error model, and that is not recommended if the samples are from the same sequencing run, see here (scroll down to figure) how that works. On the other hand, if you do not split into too small batches, the results should be almost or truly identical. If you use non-default settings for --sample_inference, do not split artificially, because there I would expect trouble (those other settings require information of all samples). You can, though, allocate more CPUs to processes with configs, resource adjustments are described here. That should also help considerably.

About adding such a possibility: That should work depending on denoising settings. The default setting, --sample_inference "independent" handles samples independent (at least the description says so) and process in parallel should work. But other settings (pseudo/pooled), which I highly recommend, wont work when splitting as far as I know.

If you have any idea or information how that might work, let me know, I am happy to learn!

sgaleraalq commented 1 year ago

Thank you very much for your fast answer! I believe that --sample_inference "independent"will make the fastest processes of them all but it still takes some time to run. I have tried with the maximum number of CPUs available but since my dataset is quite huge (~300 samples) it takes a lot of time.

Another thing that would be useful could be to add a --skip for dada2 pipelines so that if parallelization is not available, one can make the denoise on a separate workflow and then add it to nf-core pipeline. But that is just a suggestion :)

d4straub commented 1 year ago

--sample_inference "independent"will make the fastest processes

That is correct, other methods will increase runtime by a lot.

my dataset is quite huge (~300 samples) it takes a lot of time

Depending on your data and cpu resources, it might take a night/day or so, but should be manageable. Taxonomic classification might be also time consuming, depending on the number of ASVs. But those can be filtered by length, prevalence or abundance if needed.

Another thing that would be useful could be to add a --skip for dada2 pipelines so that if parallelization is not available, one can make the denoise on a separate workflow and then add it to nf-core pipeline.

I am not very convinced, because the DADA2 part is sort of important in the pipeline I think.

sgaleraalq commented 1 year ago

Then I must be doing something incorrect because takes at least 3 days for me to finish with the DENOISING step.

I am not very convinced, because the DADA2 part is sort of important in the pipeline I think.

I was just saying to be an optional parameter, just in case you dont want to use it! But I see your point that is pretty important.

d4straub commented 1 year ago

3 days is indeed a little much, but if you have an unusual high sequencing depth and very divers samples (many different taxa per sample but also between samples) it might take long. Let me know whether you give it time and whether the process progresses. You also might investigate from time to time whether the job indeed is running and not stalling because of some sort of error.

d4straub commented 1 year ago

@sgaleraalq any news?

sgaleraalq commented 1 year ago

It is still taking so long for my samples. I use these parameters which I think they should be enough for 300 samples, what do you think about it? Maybe I'm using not so many RAM?

withName:DADA2_ERR {
    memory = 115.GB
    cpus = 20
    time = 167.h
  }
  withName:DADA2_RMCHIMERA {
    memory = 115.GB
    cpus = 20
    time = 167.h
  }
    withName:DADA2_DENOISING {
    memory = 115.GB
    cpus = 20
    time = 167.h
  }

My cluster always kills the process because it exceeds the maximum time permited. I really do not know what to do to make it faster.

d4straub commented 1 year ago

Maybe it would be possible to do more aggressive quality filtering to reduce the number of bases that come into the three processes that you list here.

Some possibilities (not exhaustive):

You can check the read count report to see where many reads are lost and if thats fine to you. If that isnt available check results/cutadapt/cutadapt_summary.tsv and results/dada2/QC/*qual_stats.pdf

d4straub commented 1 year ago

I close that issue because parallelization should not be needed on denoising/chimera removal step on nextflow level (both steps support multiple cpu usage) and there are the above mentioned steps to subset large data sets if required.

@sgaleraalq please let me know in case there are further problems or there is new information on this topic.