marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
646 stars 178 forks source link

Canu assemblies contain large proportion of pseudogenes #1201

Closed russellj7 closed 5 years ago

russellj7 commented 5 years ago

Hello,

I've assembled two bacterial genomes sequenced using Nanopore with Canu v1.5 on MacOS using the default settings and an estimated genome size of 2.5 mb. The assemblies have always finished without error and result in a single output sequence (assumably the chromosome sequence). Both assemblies were circularized using Circlator and polished using Nanopolish.

However, when the sequences are annotated using NCBI's annotation pipeline, they result in an abnormally large proportion of pseudogenes (roughly 60% of all genes) with almost all of them being framshifted pseudogenes. This happens with both genomes, which I suppose makes sense because they were both processed the same way (as described above).

My problem now is how to figure out what went wrong and where. Was there a problem during the assembly, the circularization or the consensus polishing? I suppose I could try annotating draft genomes from each of the three steps above but I imagine this would be annoying to the GenBank submission staff.

Can someone offer any advice on how to identify/fix this issue?

skoren commented 5 years ago

Nothing wrong in the assembly/polishing. If you have not already, run nanopolish at least twice (with the input to the second round being the first round output) and run it in methylation-aware mode. However, the nanopore-only consensus is limited to 99.9% at best (and lower depending on sequence context e.g. https://genomeinformatics.github.io/na12878update/) which will leave a lot of pseudogenes (in a 2.5m genome with 99.9% accuracy you'd have 2500 errors). Most of these will be indels and will cause a frameshift. The only way to get around this is to add Illumina-based polishing using either Pilon or Freebayes or Racon.

See more information in this very detailed post from Ryan Wick: https://github.com/rrwick/Basecalling-comparison

russellj7 commented 5 years ago

Ah, thank you so much for the response and links! If I may ask, how would re-running Nanopolish over and over again further improve the consensus if it's working off the same data each time?

I suppose until the consensus accuracy for Nanopore gets any better I will just stick with PacBio for whole genome sequencing...

skoren commented 5 years ago

Some regions that weren't mappable before can become mappable after the first round. It can also make some edits it didn't the first time. The multiple rounds also improve PacBio's arrow polishing.