sanger-pathogens / circlator

A tool to circularize genome assemblies
http://sanger-pathogens.github.io/circlator/
Other
231 stars 59 forks source link

It looks like circlator keeps a common k-mer at the ends of a contig after changing start position. What for? #82

Closed svkazakov closed 7 years ago

svkazakov commented 7 years ago

Hi, I've run circlator on my own data, and it looks like it keeps a common k-mer at the ends of a contig, that was produced by SPAdes, after changing start position. Shouldn't circlator remove one of it?

So there was a common k-mer in a contig of input assembly, but the contig wasn't circular (see Bandage visualization in initial-assembly.png, + file 00.input_assembly.fasta) After local assembly made by SPAdes 3.7.1 in task 03.assembly, the contig becomes circular (see after-local-assembly.png, + attached file 03.assemble-contigs.fasta). And now it has an common k-mer at the ends of it with size of 127 nucs, for used k=127. (see common-k-mer.png). This contig NODE1* is then circularized using nucmer in task 04.merge but the file 04.merge.fasta still contains a common k-mer at the ends of the contig! At the final output this contig is reversed and its start position is changed, but rc-copy of the common k-mer still presents two times in the middle of the final contig.

It is a bug, isn't it?

circlator.log.txt initial-assembly after-local-assembly-2 common-k-mer

other-files.zip

martinghunt commented 7 years ago

Looks like a bug. Could you share all the files with me please so I can debug it?

Thanks, Martin

svkazakov commented 7 years ago

I have attached all working directory, is it enough? circSeq.1-3-main.zip

Sergey.

martinghunt commented 7 years ago

That's great, thanks. I'll have a look into it...

martinghunt commented 7 years ago

I agree, that's definitely a bug. The way circlator works is to trust that the spades assembly is correct, which is unfortunate in this case because it's put that kmer at each end of the contig. I will have a think about if I can do anything to catch this case. However, circlator was designed for long reads, not short illumina, and I have never seen this happen with long reads.

svkazakov commented 7 years ago

Yes, I know that Circlator was designed for long reads (usually PacBio or Oxford Nanopore), but in my case we have only two libraries with relatively long reads: 454 library with average read length 450 b.p. and Illumina with average read length 201 b.p. In both cases I got a common k-mer presented in final contig two times.

martinghunt commented 7 years ago

Even 454 reads are shorter than circlator was designed for - I wouldn't consider 450bp to be long. It's expecting reads that are kilobases in length.