rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Question about QC filtering #202

Closed LeeBergstrand closed 1 month ago

LeeBergstrand commented 1 month ago

For rule nanopore_qc_filter:

reformat.sh in={input} out={output} minlength={params.minlength} minavgquality={params.minavgquality} \
interleaved=f qin=33 threads={threads} -Xmx{resources.mem}G pigz=t unpigz=t > {log} 2>&1

Why are we setting minavgquality versus parameters such as qtrim and trimq?

minavgquality=0         (maq) Reads with average quality (after trimming) below this will be discarded.
qtrim=f                 Trim read ends to remove bases with quality below trimq.
                        Values: t (trim both ends), f (neither end), r (right end only), l (left end only), w (sliding window).
trimq=6                 Regions with average quality BELOW this will be trimmed.  Can be a floating-point number like 7.3.

@jmtsuji Would it make sense to qtrim the reads and then filter by length rather than dropping reads based on minavgquality? Or are nanopore error profiles not conducive to end trimming?

LeeBergstrand commented 1 month ago

fastqc_per_base_sequence_quality_plot

Looks like the quality scores go down over the read length. Though some of the reads go back up.

LeeBergstrand commented 1 month ago

@jmtsuji I ran stats on the assemblies of 32 genomes from one of our internal datasets. All these samples used older flowcell and base caller versions, so I had to trim them using Q score 8 (I tried many Q score cut-offs, and this had the best assemblies)and assemble them using flay nano-raw mode. Here are the results:

32_genomes_checkm 32_genomes_quast_1 32_genomes_quast_2

There was not much difference between dropping the reads with a Q score less than X and end trimming the reads to a Q score of X. I assumed that you would get slightly more data into the assembly if you were end trimming but it does not appear to make much of a difference. @jmtsuji Thoughts?

LeeBergstrand commented 1 month ago

From Mike Lynch:

End trimming makes sense in environments like Illumina and 454/pyrosequencing where enzyme decay puts Q-score recovery in question. It was my understanding that nanopore (and definitely PacBio) don’t have the same issue. That quality decay can easily recover (in some ways the Q-score of a set of nucleotide calls is at least partially independent of previous Q-scores - this obviously doesn’t hold if the Q-score decay is due to template issues).

So, keep average score trimming, which should be more representative of nanopore error profiles.

I feel we can close this.