Closed sgaleraalq closed 1 year ago
Hi there, thanks for the suggestion.
I do not think it is possible to use the config to parallelize those steps currently. You can however use the column "run" in the samplesheet to split samples into batches and process in parallel (see https://nf-co.re/ampliseq/2.4.1/usage#samplesheet-input). But that will also split those earlier for calculating the error model, and that is not recommended if the samples are from the same sequencing run, see here (scroll down to figure) how that works. On the other hand, if you do not split into too small batches, the results should be almost or truly identical. If you use non-default settings for --sample_inference
, do not split artificially, because there I would expect trouble (those other settings require information of all samples).
You can, though, allocate more CPUs to processes with configs, resource adjustments are described here. That should also help considerably.
About adding such a possibility: That should work depending on denoising settings. The default setting, --sample_inference "independent"
handles samples independent (at least the description says so) and process in parallel should work. But other settings (pseudo/pooled), which I highly recommend, wont work when splitting as far as I know.
If you have any idea or information how that might work, let me know, I am happy to learn!
Thank you very much for your fast answer! I believe that --sample_inference "independent"will make the fastest processes of them all but it still takes some time to run. I have tried with the maximum number of CPUs available but since my dataset is quite huge (~300 samples) it takes a lot of time.
Another thing that would be useful could be to add a --skip for dada2 pipelines so that if parallelization is not available, one can make the denoise on a separate workflow and then add it to nf-core pipeline. But that is just a suggestion :)
--sample_inference "independent"will make the fastest processes
That is correct, other methods will increase runtime by a lot.
my dataset is quite huge (~300 samples) it takes a lot of time
Depending on your data and cpu resources, it might take a night/day or so, but should be manageable. Taxonomic classification might be also time consuming, depending on the number of ASVs. But those can be filtered by length, prevalence or abundance if needed.
Another thing that would be useful could be to add a --skip for dada2 pipelines so that if parallelization is not available, one can make the denoise on a separate workflow and then add it to nf-core pipeline.
I am not very convinced, because the DADA2 part is sort of important in the pipeline I think.
Then I must be doing something incorrect because takes at least 3 days for me to finish with the DENOISING step.
I am not very convinced, because the DADA2 part is sort of important in the pipeline I think.
I was just saying to be an optional parameter, just in case you dont want to use it! But I see your point that is pretty important.
3 days is indeed a little much, but if you have an unusual high sequencing depth and very divers samples (many different taxa per sample but also between samples) it might take long. Let me know whether you give it time and whether the process progresses. You also might investigate from time to time whether the job indeed is running and not stalling because of some sort of error.
@sgaleraalq any news?
It is still taking so long for my samples. I use these parameters which I think they should be enough for 300 samples, what do you think about it? Maybe I'm using not so many RAM?
withName:DADA2_ERR {
memory = 115.GB
cpus = 20
time = 167.h
}
withName:DADA2_RMCHIMERA {
memory = 115.GB
cpus = 20
time = 167.h
}
withName:DADA2_DENOISING {
memory = 115.GB
cpus = 20
time = 167.h
}
My cluster always kills the process because it exceeds the maximum time permited. I really do not know what to do to make it faster.
Maybe it would be possible to do more aggressive quality filtering to reduce the number of bases that come into the three processes that you list here.
Some possibilities (not exhaustive):
You can check the read count report to see where many reads are lost and if thats fine to you. If that isnt available check results/cutadapt/cutadapt_summary.tsv
and results/dada2/QC/*qual_stats.pdf
I close that issue because parallelization should not be needed on denoising/chimera removal step on nextflow level (both steps support multiple cpu usage) and there are the above mentioned steps to subset large data sets if required.
@sgaleraalq please let me know in case there are further problems or there is new information on this topic.
Description of feature
Hi guys!
I was wondering if there is any command or options that you can add in the .config file to add parallelization in the DADA2_DENOISE and RM_CHIMERAS step, i.e. send every sample to a different node and merge all of them after they have finished with the processes. I have seen that some processes, like FILTNTRIM does that in a cluster and if it could be possible to implement it in those pipeline steps it will speed up the running time of the workflow significantly.
Thank you very much!