marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
646 stars 178 forks source link

Error 'failed to create the overlap store'. #688

Closed OmerTC closed 6 years ago

OmerTC commented 6 years ago

I'm running the following command

canu -d ./Gene_Assembly -p Gene GenomeSize=7600 minReadLength=100 minOverlapLength=100 -nanopore-raw Merge_Output_File.fastq

I'm getting error: failed to create the overlap store

Have not changed any default parameter other than minReadLength and minOverlapLength. Merge_Output_File.fastq contains nearly 20,000 reads.

I've taken a look at the error log file, correction/Gene.gkpStore.err and it seems that there are enough files loaded in the correction process. BUT, when taking a look at trimming/]Gene.gkpStore.err it seems that only 5 reads are loaded in to this stage..

I've attached the entire Gene_Assembly library for scrutinizing the error log files.

What am I doing wrong?

Thanks. Gene_Assembly.zip

skoren commented 6 years ago

What are you trying to assemble and how did you generate the sequencing data? It looks like your genome size is only 7kb but you have 72kb reads in the input. When canu selects the longest reads they're all over 45kb and have no overlaps or similarity to each other. If your genome is really <10kb then a large fraction of the data seems bogus (how do you get a 574kb read for a 7k genome). I would suggest only including reads < 5kb for assembly in that case.

OmerTC commented 6 years ago

Hi skoren, The sample is a gene amplified by PCR, run on a MinION at 1D mode. This is not an attempt to find a full genome. Actually, a friend who is better acquainted with Canu already managed to run and get nice results, consistent with a reference. Sadly, I cannot get his help anytime soon so I am just trying to reproduce his results and tweak a little bit. Either way, I know the same input file worked before.

7.6kb is the expected gene size. There are artifacts, i.e. long fragments you refer to could be low copy other Eukaryotic genes.

I suspect the problem is with syntax or with files created from earlier, failed runs which contain wrong parameters.

skoren commented 6 years ago

The issue is just one 70kb read is 10x coverage of your genome size which is why you're only ending up with 5 reads post-correction. I think either dropping reads >10kb from your input or setting corOutCoverage=1000 should work because the off-target other genes would probably not assemble anyway.

OmerTC commented 6 years ago

Thanks, I'll give it a go and update..

OmerTC commented 6 years ago

Hi Skoren, I can understand why dropping reads over 10kb would be productive, but I'm not sure I understand what corOutCoverage=1000 prevents.

Either way, I run minOverlapLength=100 minReadLength=100 corOutCoverage=1000 It did seem to work, but only 1571 reads out of 20,000 made it to gatekeeper store 'trimming/MACC1.gkpStore':.

Also, according the html files, correction, takes its input from my fastq file, trimming takes Gene.correctedReads.fasta.gz and unitiging takes from Gene.trimmedReads.fasta.gz.

The only other time I was able to run successfully Canu was on Ubuntu 17.04 (on my friend's machine). Only then out of the 20,000 reads, 19689 reads made it to the assembly and all three stages, correction, trimming, assembly used the same fastq file as input.. Why is there a difference in the input files at each stage, and so little files passed onto the trimming?

I also blasted the two output unitig multifastq files, the previous one (where I did not specify minOverlapLength=100), and the recent one and there are indels and gaps between the reads. Is this expected?

skoren commented 6 years ago

You have over 4000x coverage of the 7k region so you don't need many reads to assemble it, canu will just keep the longest which is why you ended up with 1571 / 20,000 which (assuming 500bp average read) adds up to over 100x. Setting corOutCoverage prevents canu from keeping just the longest data so it would assemble your target and anything else at sufficient coverage in the data rather than throwing out any region sequenced with the long reads.

You say your other assembly used the same input file for all steps. You mean the uncorrected file? That would negate running correction/trimming since those output the data as fastq files so if you give the same input to all you would be losing the corrected reads and trimmed reads outputs. That would probably give you a lower accuracy consensus than running canu end to end so I'm not surprised there are differences. The nanopore data also has some bias so typically bacterial genomes have identity >99% from the assembler but more complex genomes are 97-98%. You can look at signal-polishing to get a final consensus using Nanopolish.

skoren commented 6 years ago

Inactive, closing.