Interrogation regarding overlap-based assemblers

linxingchen / cobra

A tool to raise the quality of viral genomes assembled from short-read metagenomes via resolving and joining of contigs fragmented during de novo assembly.

MIT License

61 stars 10 forks source link

Thank you for the great tool! I am using MEGAHIT+VIBRANT+COBRA to get about 35 high-quality phages from each of my metagenomes after validation by CheckV. I just tested the new overlap-based assembler PenguiN (https://github.com/soedinglab/plass) and for the same samples using PenguiN+VIBRANT gives about 180 HQ phages so about 5 times more.

I am wondering how you would compare the 2 tools considering they are both based on overlap-based assemblies. What drives the high difference in my understanding is that in case of conflicted extension COBRA would often stop whereas PenguiN would use a Bayesian rule to find the best extension out of the several possibilities. I do not understand how such a liberal approach can avoid misassemblies though.

Do you have an opinion on that?

Hi, thank you for your interest in COBRA and let me know about PenguiN. I had a quick look at the preprint and found that it was specifically developed for the assembly of viral genomes and 16S rRNA gene sequences. It is different as COBRA is based on the assembly results of assemblers like metaSPAdes, MEGAHIT, and IDBA_UD, which use the de Bruijn graph and will break at kmer points when multiple paths are available. COBRA is developed to join the broken fragments based on their end overlap, the length of which is usually the max kmer. While PenguiN is a new assembler using the end overlap of reads.

If you check Figure 1B you will find that some of the sequences from PenguiN should be very very similar, you should check how many of the 180 HQ are 99% similar.

linxingchen / cobra

Interrogation regarding overlap-based assemblers #47