Reg: Pipeline crashes and enhancements

harish0201 commented 3 years ago

Hi!

Thank you for developing this tool! It solves a lot of my issues to be honest and gives me consistently better results.

I've been trying out dadaist2 since the past few days and I had some issues with running the same; since I've upgraded to v0.73.

I have in the mean-time pinned the version 0.4 since I'm running in the qiime2 environment for the time-being, though I should be upgrading shortly.

Using the new parameter: --skip-qc: The pipeline seems to not detect that the folder structure (whilst running with Fastp) is not getting created; thus giving the following errors:

[2021-03-06 09:47:51] (1/3) Processing sample1: skip

[2021-03-06 09:47:51] Skipping QC:sample1

[2021-03-06 09:47:52] (2/3) Processing sample2: skip

[2021-03-06 09:47:52] Skipping QC:sample2

[2021-03-06 09:47:53] (3/3) Processing dadaist2.log: skip

[2021-03-06 09:47:53] Skipping QC:dadaist2.log

Use of uninitialized value $from in string eq at /data/miniconda3/lib/5.26.2/File/Copy.pm line 64.

Use of uninitialized value $from in -d at /data/miniconda3/lib/5.26.2/File/Copy.pm line 96.

Use of uninitialized value $_[0] in substitution (s///) at /data/miniconda3/lib/5.26.2/File/Basename.pm line 341.

fileparse(): need a valid pathname at /data/miniconda3/lib/5.26.2/File/Copy.pm line 51.

Dadaist2 execution finished (16.00s)

Trimming Primers: The seqfu based trimmer is somewhat slow, I've been currently handing off data from cutadapt for better runtimes (even under R). I believe that this is due to cutadapt under python3 does multi-threading based on spamming htop usage during the run, whereas fu-primers is using a single thread for some reason on my ThinkPad.
Enhancement for Error Detection: Instead of picking the first sample and using nreads.learn to learn error rates, why not subsample reads from all the samples totaling nreads.learn? This ideally would be more robust in the aspect that we can capture a better error profile.
Enhancement for Taxonomy assignments: Give the option to GTDB database for dada2's classifier? I checked the 0.73 and it's available for DECIPHER which is super lightweight (I don't even hit swap on my 8GB RAM), but dada2 classifier is somewhat a bit lenient so it's able to assign a fair-bit more number of sequences (though it's expected to hit swap).

telatin commented 3 years ago

Hi there and many thanks for the feedback. ATM the --skip-qc is assuming it was done previously with the same structure indeed but its planned to make it more generalized soon. Thanks also for the enhancement requests: there is a major development undergoing now and it's the best time for this :)

telatin commented 3 years ago

Qc relates to #8

telatin commented 3 years ago

RE number 2: cutadapt is also integrated now RE number 4: I added more databases including GTDB for DADA2 in the last update, it will be on Conda (0.7.6) next week.

harish0201 commented 3 years ago

That's great!

Glad to know that!

quadram-institute-bioscience / dadaist2

Reg: Pipeline crashes and enhancements #9