quadram-institute-bioscience / dadaist2

Dadaist2 šŸŸØ Highway to R
https://quadram-institute-bioscience.github.io/dadaist2/
MIT License
13 stars 5 forks source link

[BUG] DADA2 ERROR while running my data #18

Closed najibveto closed 8 months ago

najibveto commented 2 years ago

Describe the bug hello, sorry to disturb you again. I tried the dadaist2 on my data after i cleaned it using trimmomatic tool. first i checked the data using seqfu: image

then i made the metadata using the command dadaist2-metadata:

image

I run it as usual:

dadaist2  --max-loss 0.05 -i metagenome/16S/ -o water -m metadata.tsv -d ~/refs/silva_nr_v138_train_set.fa.gz
    ____            __      _      __ ___
   / __ \____ _____/ /___ _(_)____/ /|__ \
  / / / / __ `/ __  / __ `/ / ___/ __/_/ /
 / /_/ / /_/ / /_/ / /_/ / (__  ) /_/ __/
/_____/\__,_/\__,_/\__,_/_/____/\__/____/

1.2.5

[WARNING] Output directory found.
 This is a warning but in future releases this might require to specify --force to proceed.
[2022-08-02 14:13:48] Ready to log in /home/najib/water/dadaist.log
[2022-08-02 14:13:48] dadaist2 1.2.5
[2022-08-02 14:13:48] Taxonomy database found: /home/najib/refs/silva_nr_v138_train_set.fa.gz
[2022-08-02 14:13:48] Parameter: taxonomy-type: dada2
[2022-08-02 14:13:48] Parameter: taxonomy-db: /home/najib/refs/silva_nr_v138_train_set.fa.gz
 * Input directory: metagenome/16S/
 * Output directory: /home/najib/water/
 * Metadata: metadata.tsv
 * Reference database: /home/najib/refs/silva_nr_v138_train_set.fa.gz
 * Threads: 6
 * Temporary directory: /tmp/dadaist2_sJ798a
 * QC strategy: skip
[2022-08-02 14:13:48] QC: Checking quality profile with SeqFu
[2022-08-02 14:13:48] SeqFu quality truncation at (trunc-len-1 and trunc-len-2): 290 - 231
[2022-08-02 14:13:48] Checking dependencies
 * RScript: R scripting front-end version 4.0.5 (2021-03-31)
 * Taxonomy: dadaist2-assigntax 1.1.3
 * assign-taxonomy: dadaist2-assigntax 1.1.3
 * clustalo: 1.2.4
 * dada2 (lib): <pass>
 * exporter: dadaist2-exporter 1.4.0
 * fastp: fastp 0.23.2
 * fasttree: FastTree version 2.1.11 Double precision (No SSE3):
 * fu-primers: fu-primers 1.12.0
[2022-08-02 14:13:54] Temporary directory: /tmp/dadaist2_sJ798a
[2022-08-02 14:13:54] Threads: 6
[2022-08-02 14:13:54] Output directory: /home/najib/water/
[2022-08-02 14:13:54] Checked metadata for autumn
[2022-08-02 14:13:54] Checked metadata for spirng
[2022-08-02 14:13:54] Checked metadata for summer
[2022-08-02 14:13:54] Checked metadata for winter
[2022-08-02 14:13:54] Input directory "metagenome/16S/": 4 found (paired-end)
[2022-08-02 14:13:54] (1/4) Processing autumn: skip
[2022-08-02 14:13:54] Copying input reads for DADA2
[2022-08-02 14:13:54] (2/4) Processing spirng: skip
[2022-08-02 14:13:54] Copying input reads for DADA2
[2022-08-02 14:13:54] (3/4) Processing summer: skip
[2022-08-02 14:13:54] Copying input reads for DADA2
[2022-08-02 14:13:54] (4/4) Processing winter: skip
[2022-08-02 14:13:54] Copying input reads for DADA2
[2022-08-02 14:13:54] Running DADA2...
[2022-08-02 14:13:54] Dada2 script parameters:
 * [1] forward_reads: /tmp/dadaist2_sJ798a/for
 * [2] reverse_reads: /tmp/dadaist2_sJ798a/rev
 * [3] feature_table_output: /tmp/dadaist2_sJ798a/dada2/dada2.tsv
 * [4] stats_output: /tmp/dadaist2_sJ798a/dada2/stats.tsv
 * [5] filt_forward: /tmp/dadaist2_sJ798a/for/filtered
 * [6] filt_reverse: /tmp/dadaist2_sJ798a/rev/filtered
 * [7] truncLenF: 290
 * [8] truncLenR: 231
 * [9] trimLeftF: 0
 * [10] trimLeftR: 0
 * [11] maxEEF: 1
 * [12] maxEER: 1.5
 * [13] truncQ: 10
 * [14] chimeraMethod: consensus
 * [15] minFold: 1
 * [16] threads: 6
 * [17] nreads_learn: 0
 * [18] baseDir: /tmp/dadaist2_sJ798a
 * [19] doPlots: do_plots
 * [20] taxonomyDb: /home/najib/refs/silva_nr_v138_train_set.fa.gz
 * [21] saveRDS: no
 * [22] noMerge: 0
 * [23] processPool: 0
[2022-08-02 14:22:48] DADA2 Finished.
[2022-08-02 14:22:48] Converting dada2 taxonomy output: /tmp/dadaist2_sJ798a/taxonomy.tsv
[2022-08-02 14:22:48] 922 representative sequences found.
DADA2 ERROR:
[2022-08-02 14:22:48] DADA2 filtered too many reads: 4.7926% from total 486266 to 23305
[2022-08-02 14:22:48] Multiple sequence alignment and tree generation
[2022-08-02 14:23:30] Feature tree generated
[2022-08-02 14:23:30] Dadaist finished, output files saved:
 * dada-taxonomy-table: /home/najib/water/taxonomy.txt
 * feature-table: /home/najib/water/feature-table.tsv
 * features-tree: /home/najib/water/rep-seqs.tree
 * multiple-alignment: /home/najib/water/rep-seqs.msa
 * rep-seqs: /home/najib/water/rep-seqs.fasta

as u can see, there was DADA2 error and the tool didn't generate MicrobiomeAnalyst files. how is it possible to fix it, so i can get the files for microbiomeanalyst and then makde the phyloseq object? thank you and sorry for the trouble again.

telatin commented 2 years ago

Howdy! the problem must be fixed at the source of the problem: too many filtered reads. One way of course is just lowering the threshold to allow an aggressive filtering, but in this case 4% of the totals looks worth investigating and maybe adjusting the parameters (truncQ, maxee, trunc...) to have less reads filtered in the first place. Different providers or sequencing core can have very different output: once you can tune some parameters based on your usual supplier, you should be able to adjust the pipeline quickly.

A way to investigate the biggest loss is checking the dada2-stats file where you'll see the number of reads retained at each step. Can you please post it?

najibveto commented 2 years ago

thank you for your reply. the dada2-stats file is a follow: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Ā  | input | filtered | denoised | merged | non-chimeric -- | -- | -- | -- | -- | -- W2111_R1.fastq.gz | 71411 | 19455 | 19455 | 13308 | 2653 W2201_R1.fastq.gz | 76186 | 23861 | 23861 | 20669 | 2085 W2202_R1.fastq.gz | 69819 | 25303 | 25303 | 21829 | 2245 W2203_R1.fastq.gz | 91891 | 46711 | 46711 | 34512 | 6685

i tried to lower the loss to 5% as suggested but i got the same error.

telatin commented 2 years ago

From the stats, there is a significant loss in the filtering step, but not dramatic. A further significant loss is in the non-chimeric step. Maybe relaxing the initial filtering can improve the process, if the chimaera detection is right, maybe the library was amplified a lot (or there could be other sources?)

While I do not recommend lowering the loss parameter to bypass the issue, as this is not a bug but a sanity check to prevent misinterpreting potentially noisy results, if the message is DADA2 filtered too many reads: 4.7926%, you should try with 4% (or simply 1% to keep it disabled :) )

najibveto commented 2 years ago

hello, I tried both percentage for loss (1% and 4%) and both of them worked and i got the phyloseq object as well the MicrobiomeAnalyst files, when i tried with 5% loss, i got the usual error. i put the one with 4% loss.


ā•­ā•“ļŒ› at ļŸ ~ via ļ Ÿ v3.9.13 via šŸ…’  dadaist
ā•°ā”€ļ•“ dadaist2  --max-loss 0.04 -i metagenome/16S/ -o water -m metadata.tsv -d ~/refs/silva_nr_v138_train_set.fa.gz
    ____            __      _      __ ___
   / __ \____ _____/ /___ _(_)____/ /|__ \
  / / / / __ `/ __  / __ `/ / ___/ __/_/ /
 / /_/ / /_/ / /_/ / /_/ / (__  ) /_/ __/
/_____/\__,_/\__,_/\__,_/_/____/\__/____/

1.2.5

[WARNING] Output directory found.
 This is a warning but in future releases this might require to specify --force to proceed.
[2022-08-16 09:20:21] Ready to log in /home/najib/water/dadaist.log
[2022-08-16 09:20:21] dadaist2 1.2.5
[2022-08-16 09:20:21] Taxonomy database found: /home/najib/refs/silva_nr_v138_train_set.fa.gz
[2022-08-16 09:20:21] Parameter: taxonomy-type: dada2
[2022-08-16 09:20:21] Parameter: taxonomy-db: /home/najib/refs/silva_nr_v138_train_set.fa.gz
 * Input directory: metagenome/16S/
 * Output directory: /home/najib/water/
 * Metadata: metadata.tsv
 * Reference database: /home/najib/refs/silva_nr_v138_train_set.fa.gz
 * Threads: 6
 * Temporary directory: /tmp/dadaist2_1fIjRN
 * QC strategy: skip
[2022-08-16 09:20:21] QC: Checking quality profile with SeqFu
[2022-08-16 09:20:22] SeqFu quality truncation at (trunc-len-1 and trunc-len-2): 290 - 231
[2022-08-16 09:20:22] Checking dependencies
 * RScript: R scripting front-end version 4.0.5 (2021-03-31)
 * Taxonomy: dadaist2-assigntax 1.1.3
 * assign-taxonomy: dadaist2-assigntax 1.1.3
 * clustalo: 1.2.4
 * dada2 (lib): <pass>
 * exporter: dadaist2-exporter 1.4.0
 * fastp: fastp 0.23.2
 * fasttree: FastTree version 2.1.11 Double precision (No SSE3):
 * fu-primers: fu-primers 1.12.0
[2022-08-16 09:20:27] Temporary directory: /tmp/dadaist2_1fIjRN
[2022-08-16 09:20:27] Threads: 6
[2022-08-16 09:20:27] Output directory: /home/najib/water/
[2022-08-16 09:20:27] Checked metadata for autumn
[2022-08-16 09:20:27] Checked metadata for spirng
[2022-08-16 09:20:27] Checked metadata for summer
[2022-08-16 09:20:27] Checked metadata for winter
[2022-08-16 09:20:27] Input directory "metagenome/16S/": 4 found (paired-end)
[2022-08-16 09:20:27] (1/4) Processing autumn: skip
[2022-08-16 09:20:27] Copying input reads for DADA2
[2022-08-16 09:20:27] (2/4) Processing spirng: skip
[2022-08-16 09:20:27] Copying input reads for DADA2
[2022-08-16 09:20:27] (3/4) Processing summer: skip
[2022-08-16 09:20:27] Copying input reads for DADA2
[2022-08-16 09:20:27] (4/4) Processing winter: skip
[2022-08-16 09:20:27] Copying input reads for DADA2
[2022-08-16 09:20:27] Running DADA2...
[2022-08-16 09:20:27] Dada2 script parameters:
 * [1] forward_reads: /tmp/dadaist2_1fIjRN/for
 * [2] reverse_reads: /tmp/dadaist2_1fIjRN/rev
 * [3] feature_table_output: /tmp/dadaist2_1fIjRN/dada2/dada2.tsv
 * [4] stats_output: /tmp/dadaist2_1fIjRN/dada2/stats.tsv
 * [5] filt_forward: /tmp/dadaist2_1fIjRN/for/filtered
 * [6] filt_reverse: /tmp/dadaist2_1fIjRN/rev/filtered
 * [7] truncLenF: 290
 * [8] truncLenR: 231
 * [9] trimLeftF: 0
 * [10] trimLeftR: 0
 * [11] maxEEF: 1
 * [12] maxEER: 1.5
 * [13] truncQ: 10
 * [14] chimeraMethod: consensus
 * [15] minFold: 1
 * [16] threads: 6
 * [17] nreads_learn: 0
 * [18] baseDir: /tmp/dadaist2_1fIjRN
 * [19] doPlots: do_plots
 * [20] taxonomyDb: /home/najib/refs/silva_nr_v138_train_set.fa.gz
 * [21] saveRDS: no
 * [22] noMerge: 0
 * [23] processPool: 0
[2022-08-16 09:28:44] DADA2 Finished.
[2022-08-16 09:28:44] Converting dada2 taxonomy output: /tmp/dadaist2_1fIjRN/taxonomy.tsv
[2022-08-16 09:28:44] 922 representative sequences found.
[2022-08-16 09:28:44] DADA2 filtered 4.7926% from total 486266 to 23305
[2022-08-16 09:28:44] Multiple sequence alignment and tree generation
[2022-08-16 09:29:20] Feature tree generated
[2022-08-16 09:29:20] Exporting MicrobiomeAnalyst
[2022-08-16 09:29:24] Generating PhyloSeq object
[2022-08-16 09:29:26] Rhea normalization/alpha finished.
[2022-08-16 09:29:26] Dadaist finished, output files saved:
 * dada-taxonomy-table: /home/najib/water/taxonomy.txt
 * feature-table: /home/najib/water/feature-table.tsv
 * features-tree: /home/najib/water/rep-seqs.tree
 * mba-files: /home/najib/water/MicrobiomeAnalyst
 * multiple-alignment: /home/najib/water/rep-seqs.msa
 * phyloseq: /home/najib/water/R/phyloseq.rds
 * rep-seqs: /home/najib/water/rep-seqs.fasta
 * rhea: /home/najib/water/Rhea