rrwick / Unicycler

hybrid assembly pipeline for bacterial genomes
GNU General Public License v3.0
535 stars 132 forks source link

Hybrid assemblie gives more contigs than "long-read only" #304

Open PauObregon opened 1 year ago

PauObregon commented 1 year ago

Dear developers & community,

I am assembling the genome of a bacterial strain and I have long-reads (nanopore) and short (illumina). I first assembled the long alone with Canu, from where I got 12 contigs (10 after ciclator). I used this assembly in Unicycler as "existing_long_read_assembly" (alltogether with long and short reads). Surprisingly, my final Unicycler output fasta is of 42 contigs. Any idea about what can be happening? How could I improve this assembly? Thanks in advance!!

444thLiao commented 1 year ago

Hi, recently I also encouter this issue. In my data, I think the 'breaking' is resulted by the wrongly short-read assembly.

To test it, you could also try long-read only (only input long reads to unicycler) mode, which I think will generate less than 12 reads.

However, when you use the hybrid mode, it will only use the short-read assembly as anchors and map the long reads to these anchors for bridging. Thus, the 'breaking' is not accurate since it doesn't use long reads as backbone. With this process, it will result in the 'breaking' issue when a short-read assemblied contig is able to be mapped to two different places at the long-read assembly. When this happen, it will wrongly connect/bridge the short-reads assemblied contigs (which is also the anchor) using long read data. This issue will happen if your genomes have many repeats, which will generate a wrong contig.

Actually, I didn't resolve this issue for now. But I think removing/correcting these wrong contigs will help.

You could start from restricting the length of anchors using the parameter min_anchor_seg_len. However, in my data, very long contig can be wrongly assemblied. Thus, I still looking for the solution.