marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

Contigs contain many small duplicates in canu 2.0 dev using HIFI data #1586

Closed HenrivdGeest closed 4 years ago

HenrivdGeest commented 4 years ago

I am using a very recent github pull of CANU (v2.0-development +312 changes (r9805 b6670537326b323d2837d7801d089f4f52b860c2)) in order to assembly our hifi pacbio data. In terms of continuity this version of canu made a very nice improvement over the 1.9 release without hifi mode (my guess is that the homopolymer compression alignments are the cause of this?). Based on a chromosome level reference genome we can see that this high continuity is likely correct, ie it does not seem to contain obvious miss-assemblies. However, when we do map the hifi reads data back to our assembly, we see many 'chimeric' situations where just zero reads seems to span a particular position. All reads from 5' or 3' seem to clip. I dived deeper into it and aligned a canu 2.0 r9805 unitig to a canu1.9 non hifi assembly contig and found the the unitig(or the contig) contains many small tandem duplicates of 1-3kb. Its seems to me this is an artifact and not the biology here. I am fully aware that I am using a canu snapshot here, so this might be something on the radar, or even already fixed in the latest. In the picture you see 3 alignment parts, all originating from the same unitig. They are all supplementary alignments. image. A bit more zoom-out where you see the complete contig alignment region: image

Overall, the canu 2.0dev unitigs/contigs seem ~ 10% bigger than the can1.9 contig because of this, in our case.

command: canu_git/Linux-amd64/bin/canu -p bla -d bla -pacbio-hifi -s specs.spec I specified the corrected error rate to 1% in the spec file

skoren commented 4 years ago

To clarify, you're saying there are multiple copies of a tandem in the final asm that aren't in the reads? With the current version, the error rate you set is likely too low, we default it to 5% currently for consensus. You could try re-generating the consensus at the higher error rate. Without it, it is possible some reads won't get properly aligned during consensus and you'll end up with a pasted together consensus rather than a true consensus. You could also try running racon with the HiFi data on the contigs you generated.

That said, we're still working on the development version of hifi, in particular the consensus. Expect changes/improvements in the next few weeks.

HenrivdGeest commented 4 years ago

yes you are right, and I think a polishing step does not really help, since the reads do no not fully align, it stops aligning at the point with the duplication the contig is no constructed as follows (for example), with the following reads: reads: ABCDE CDEFG FGHIJ contig: ABCDEEFGHHIJ (I am not sure if the tandem/duplication occurs at the end of the source reads)

its not the same part that is always duplicated, but always a small portion of the read.

skoren commented 4 years ago

That sounds like it might be a consensus issue, try updating to the latest tip and re-running. There have been some large-scale changes to consensus and see if it is still a problem. If it is, would you be able to share the data for us to debug locally?

skoren commented 4 years ago

Any updates? We've seen improved consensus for a variety of genomes (human, plant, animal) from either the hicanu_rc or tip versions over what you had used that likely fixed the issue.

HenrivdGeest commented 4 years ago

I just started the assembly from hifi with the latsst Canu git pull. Keep you posted.

mbhall88 commented 4 years ago

Thought I would piggy-back on this and mention I am seeing pretty big duplications in my assembly using CCS/Hifi reads also.
I am attempting to assemble a Mycobacterium tuberculosis sample (which I actually have Illumina, nanopore, and CCS for). My expected genome size is ~4.4Mb, but the assembly I get from Canu ends up with 332 contigs, and a genome size of ~10.7Mb (2.424 duplication). The Canu assembly I get from the nanopore reads is 31 contigs and a genome size ~6Mb (1.36 duplication). In contrast, using Flye with the CCS data I get a single contig that is very close to my expected genome size and for ONT data I get 3 contigs slightly bigger than expected genome size.

Mtb has a fairly GC-rich genome (~65%) and has some fairly repetitive regions. Given this, I also tried the assembly with Canu setting corMaxEvidenceErate=0.15 (as recommended in the docs), but it didn't make any noticeable difference.

I am working off https://github.com/marbl/canu/commit/7ae7a097fab4819ab4d29bdcce0f38aa744f571c

skoren commented 4 years ago

@mbhall88 not really related to the above as your issue is not consensus nor repeat related. Your input is likely not clonal and the larger genome size is due to Canu separating the variants present in the sample. I would guess there is one large contig in the Canu assembly which is close to the genome size with extra contigs representing variants, some of which may be marked as bubbles. We recommend using purge_dups after a HiFi assembly, normally this is only needed for eukaryotes or diploid samples but is also necessary for non-clonal bacterial samples.

mbhall88 commented 4 years ago

Ah, apologies for gate-crashing this issue.

skoren commented 4 years ago

No worries, post the full report from your Canu runs, it may have more info on what is going on. If you are able to share the data, we could also try running it locally.

HenrivdGeest commented 4 years ago

I tried to relocate the issue, but its hard to find since it didn't occur always.However, with the git pull of 12 feb, the contigs with the issues showed that now its fine. The mummer alignment perfectly show the original problem: On the y-axis you see the new git pull assembly and on the horizontal x axies the old assembly. you can see that tig0131 has 2 short duplications, where tig2508 doesn't have it. image Or in a more zoom-in plot: image

However, since I am now looking for certain cases with sudden read mapping drops, I also spotted cases like in the image below: image On the top track you see Q20 hifi data mapped with minimap -xmap-pb to the assembly. The bottom track are the unitigs of this assembly.( minimap2 -xasm20 -N25 -p 0.000000001) The read mapping track shows regions with almost no (sometimes zero, sometimes there are 1 or 2 reads spanning) reads I suspected that maybe there are alternative allele contigs generated for these pieces which attracts all the reads. However the unitigs track also do not show any alignments? Although I cannot be hundred percent sure of this, I am wondering if this is an artefact.

skoren commented 4 years ago

We've seen a few cases of sequence context causing lower coverage but I don't think this is the cause of your drops. I expect these are mapping artifacts due to repeats, are you showing secondary/low mapQ alignments in these plots? Minimap will both filter repetitive k-mers to avoid seeding mappings with them and mark reads that have multiple mapping locations as ambiguous. The assembly is much more stringent (and is operating on higher quality reads) so can localize reads mapping cannot.

skoren commented 4 years ago

Idle, original issue resolved by release candidate branch.