rrwick / Unicycler

hybrid assembly pipeline for bacterial genomes
GNU General Public License v3.0
535 stars 132 forks source link

Assembly discrepancy with and without read dereplication #331

Closed pjdiebold closed 4 months ago

pjdiebold commented 4 months ago

Hi, I am using unicycler to build hybrid bacterial genomes. I am very happy with it's performance! Thankyou for this amazing tool.

When I first started using Unicycler, I was not doing thorough QC of my reads before assembly. I decided to rewrite the pipeline with more read filtering and discovered that my assemblies became more fragmented. For example, one of my genomes which was assembling into 1 segment and marked "complete" by the log file, became 13 segments after read QC. Below is my QC workflow

I discovered that the only filtering step that seemed to affect the final assembly was the dereplication of short reads. I have a few questions:

rrwick commented 4 months ago

I'm afraid I don't have any insights here - I often use fastp, but with default parameters (so no deduplication).

It's not clear to me why deduplication would cause the problem, but it's also not clear why deduplication would be helpful for assembly. If two reads were identical, they will contribute the same k-mers to the assembly graph, so it shouldn't be a problem. So yes, I think you can trust the assembly produced without deduplication.

Regarding read filtering, I generally advise against Porechop (it's old and abandoned now). ONT basecallers can trim off adapters, and even if untrimmed adapters are left on reads, they don't cause problems with assembly. For Filtlong, I usually run it without Illumina reads as a reference for the reasons stated here. Simply assessing long reads using their own quality scores is safer, in my opinion.