marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
649 stars 178 forks source link

assembly overlapped amplicon data, where to find contained reads #2327

Closed xiaoli-dong closed 1 week ago

xiaoli-dong commented 3 weeks ago

I am assembling tilling amplicon data with canu, the amplicon length varies from 500 to 1200 bp. With canu, I can assembly some of the region together to a single contig but there are some other regions failed to assembly although there are a lot of overlapped reads. For those which cannot be unassembled together, there are reads in the unassembled.fasta. I am planning to pick the longest reads from the unassembled file or the longest reads from the contained reads as the represented sequences for the region. Could you give me some hints whether it will make sense? and where I can find the contained reads?

skoren commented 3 weeks ago

By definition contained reads wouldn't be on their own since they can't form contigs. They'd be assigned to be part of a contig with the read that contains them. That may be in the unassembled fasta (with contigs with >1 read) or in the contig sequences. The canu outputs will give coordinates for all reads included in the assembly (both contig and unassembled) that you can use to find where each read ended up (see https://canu.readthedocs.io/en/latest/tutorial.html#outputs).

I don't think you want the longest read, I suspect the longest reads may just be artifacts or off-target sequences. You likely want best supported (highest coverage) contig or something like it. See also #2235 and #2269 for some possible parameter tweaks for amplicon assembly.