mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
770 stars 167 forks source link

Question regarding alt_group #349

Closed snayfach closed 3 years ago

snayfach commented 3 years ago

I'm using metaFlye to identify complete/circular viral genomes from metagenomes. Based on some testing using CheckV, I've found that circular contigs labeled as repetitive are often not true complete genomes, which makes sense. I also found false positive circular viral genomes when the alt_group field is not equal to "". Can you shed any light on what this field means and why a value other than could be indicating an assembly artifact? For other users who might be interested, both the final assembly and 22-plasmids directory both contained bona fide circular viral genomes, which was nice to see.

Thanks for your help, Stephen

mikolmogorov commented 3 years ago

Hi Stephen,

Hmm, I am a bit surprised to hear about circular contigs that were marked as alternative. Have you been using --keep-haplotypes option?

Flye (and metaFlye) collapse bubbles and more complex local tangles that correspond to potential structural variation between the haplotypes. In case of metagenomes, the bubbles often correspond to intra-strain variation. When Flye identifies a bubble, it assigns the same alt_group number to both branches - this way you can recover which sequences were alternative to each other.

By default, bubbles are collapsed and the alternative paths are disconnected from the graph. Thus, if a contig has assigned alt_group id, it used to be a part of the collapsed bubble. The option --keep-haplotypes preserves the bubbles, so that you can recover both alternative branches.

Hope this helps. We haven't extensively tested the haplotype output yet, so the labelling might not be perfect.

Mikhail

snayfach commented 3 years ago

To answer your question, I used the command: flye -t 64 --plasmids --nano-raw /path/to/fastq --meta --out-dir /path/to/out with version 2.8.2-b1689. However, looking at the output more closely, I now see that only the marked repeats have an assigned alt_group id (though not all repeats have an assigned alt_group id). So throwing away the repeats solves both issues for me.

Thanks again for your help.

Stephen