rrwick / Unicycler

hybrid assembly pipeline for bacterial genomes
GNU General Public License v3.0
536 stars 132 forks source link

Multiple frameshifts in complete assembled bacterial genomes #115

Open kanikabansal1991 opened 6 years ago

kanikabansal1991 commented 6 years ago

I have assembled few bacterial complete genomes using nanopore long reads via Unicycler bold module followed by multiple rounds of polishing by pilon using Illumina Miseq reads. While submitting them to NCBI, multiple pseudogenes (more than 10%) were detected by them while annotating our genomes. For which they have told it may be due to multiple frameshifts from insertions or deletions in the genome sequence. Moreover, I have also encountered such an issue as in, while doing SNP calling in case of draft genomes (obtained using Illumina reads only which were Spades assembled) SNPs were in the range of 100-200, while using complete genomes we are getting SNPs in the range of 3000. I am anticipating either there is some issue with assembly or nanopore sequencing. While nanopore sequencing output was 4 GB in 20 hrs run and basecalling was done using albacore. Kindly guide me where I am going wrong and getting so many frameshifts in the assembly.

rrwick commented 6 years ago

What is your Illumina read coverage like? Or to put it another way, how does the Illumina read assembly graph look?

Nanopore-only sequencing still tends to create lots of indel errors. These should mostly be fixed by the Illumina reads, but if the Illumina reads don't cover the entire genome, then many parts of the assembly will be Nanopore-only, and probably retain a lot of indel errors.

If this is the case, there's no great solution. Your best bet is probably to run Nanopolish on the assembly, followed by Pilon. This should fix many of the errors (but not all).

Ryan