marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
649 stars 178 forks source link

PacBio Read correction bias #468

Closed chklopp closed 7 years ago

chklopp commented 7 years ago

When comparing the insertion and deletion rates between raw and canu corrected reads we found that while the insertions are close to completely removed, the deletion are mostly kept. This bias is more important for Cs and Gs than for As and Ts.

corresponding graph

Why is canu not correcting the deletions as well as the insertions?

skoren commented 7 years ago

Is this nanopore or pacbio data? In both cases, there are limits to the information in the base called sequences. This is especially true for nanopore data which, until recently, collapsed homopolymers when base calling to at most 5bp. There are other base calling biases in the nanopore data. These deletions can't be corrected from the reads because all the reads have the wrong baseballs and would lead to a deletion bias in the results.

Canu is not meant to give a perfect consensus for the corrected reads or the assembly, just sufficient to accurately separate repeats and equalize the error rate distribution, that's why we recommend using the signal-level polishing tools (like arrow or nano polish) to get a final consensus.

chklopp commented 7 years ago

This is PacBio data.

chklopp commented 7 years ago

The deletion bias corresponds to 3 to 9% of the reads. So most of the reads have the same homopolymer length as the reference. Why is it more difficult to correct deletions than insertions? and specially in the case of monomers which have the same error rate for deletions and insertions in raw reads and not in the corrected ones.

skoren commented 7 years ago

You said the error rate is the same in insertion and deletions in the raw data, I don't see that in the chart. It seems that insertions are much lower, especially in long homopolymers stretches (first row) vs deletions (third row). In both cases, the corrected reads are dropping the reads with errors down by about 3% or more (except GC deletion which seems to be the least corrected) so the insertion is eliminated but the deletions aren't.

Homopolymers are typically the hardest to sequence so you're going to see more errors there. PacBio data does have an increase in deletion in homopolymers regions, and at least in the past it was higher in GC regions (see this paper for details: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-5-r51). When you say 9% of the reads, you're looking at aggregate statistics over the full genome, all homopolymers? A specific corrected reads is only corrected based on a small subset of reads from that region of the genome which could have a different error profile than the average. Have you looked at specific reads which are not corrected to the correct length to see what raw reads map to that position in the genome and if they have a higher error rate than the average. I'd guess there are a few regions where the raw reads have a bad call so the majority vote is wrong. From your current analysis it isn't clear if the remaining uncorrected reads are due to higher error in certain regions or from alignment/consensus. Ideally, you'd want to look at the actual alignments Canu used to generate the corrected reads to determine whether it is possible to call the correct homopolymer length.

Fundamentally we haven't worked very hard to get the corrected sequence 100% right so when the raw sequences have a higher error the corrected reads have higher error too.

chklopp commented 7 years ago

Figure 5 of the article shows close to no homopolymer bias in pacbio data. Which is not at all what I find in my data.

I'm looking at aggregate statistics over all polymers of a certain length. Right now, I'm only counting the insertion and deletion events (meaning not their length) in the cigarline. There for I underestimate the real error rate.

brianwalenz commented 7 years ago

I don't see anything that we can (or want to) do here. I'll point out that a month or two ago we changed the way evidence reads are aligned when generating corrected reads. This resulted in longer corrected reads that worked better for assembly. We didn't explicitly analyze the quality of the new corrected reads.