Closed russellj7 closed 4 years ago
Hi,
This node is likely a single repeat that was left unresolved. So you have one long unique edge, and one short repetitive edge. Contigs are build from unique edges and are extended into repeats - so the final singe contig contains both the unique edge and one or multiple copies of the repeat. Because the repeat is already part of a contig, it is not output as a separate sequence (hence only one contig in assembly.fasta
).
Contigs are only marked circular if there are no unresolved repeats. It is indeed nearly complete in this case, but you can't rule a possibility of (i) tandem repeat (so edge should be repeated) or (ii) plasmid sharing a repeat with a chromosome. It is likely that with manual analysis you can conclude that the chromosome complete or not (for example, using coverage). It is somewhat difficult to derive an automatic rule for that though.
Hope this helps, Mikhail
Hi, I'm puzzling over what might be a similar question, where contig sizes differ in the assembly (fasta file) and the assembly graph.
The assembly graph in bandage looks like this:
but the contigs in the scaffolds.fasta have different lengths:
Do you think this is from the unique contigs (= contigs 2 and 3 in the graph) being extended into repeat (= contig 1 in the graph), in an overlapping way? (suggesting additional smaller collapsed repeats)?
Thanks for any insights!
@AnnaSyme
Yes, this is what I would guess too - a circular chromosome with one unresolved repeat (from the graph structure and coverage of the edges). You can also see that the two copies of the repeat should be inverted in the genome (you need to go in both directions in the Bandage graph to traverse the entire graph). As you said, contig_2
was extended into the repeat in both directions (you can see it in the last column of the table).
@fenderglass Thank you!
In the graph, contig2 + full extension into the inverted repeat (twice) is shorter than contig2 (graph path 1,2,-1) in the assembly.fasta. Could the assembly.fasta separate out some repeats that are collapsed in the assembly graph perhaps (e.g. smaller repeats within the inverted-repeat)?
@AnnaSyme hmm, that doesn't sound right. In fact, the length of contig_3
also became 2kb longer. One thing to check is that you are comparing the graph and contigs after polishing: those would be assembly_graph.gfa
, assembly.fasta
and assembly_info.txt
files in the output folder. Also, could you send me the log file?
@fenderglass In regards to my original 2 questions, with additional PacBio data it appears the repeat is resolved. (See graph below). The additional data must have contained a read(s) that spanned the unresolved repeat. The coverage for the repeat was about twice that of the rest of the chromosome, so perhaps it's a repeat with multiple copies.
Thank you for answering my questions! And thank you for creating/maintaining Flye. It has been a great resource as an alternative to Canu or HGAP. (It produces better, faster results too 👍 )
@russellj7
Thanks for the feedback! If you are curious, you can check if the repeat was indeed resolved by aligning the repetitive sequence from the 1st assembly (you can extract it from the assemby_graph.gfa
) to the new assembly.
Hi @fenderglass
Sorry I've jumped in this issue but mine may be different after all.
I was using flye version 2.3.7 where I got these differing contig sizes (between the scaffolds and the graph - both after polishing). But: using version 2.4 I don't get any differences.
(I'm guessing that in version 2.3.7 there were repeats that were collapsed in the assembly graph, but able to be resolved better (separated) in the assembly fasta file? But that newer flye versions can show these in the assembly graph?)
I will use flye versions 2.4 or onwards. Thanks for all your help!
@AnnaSyme hmm, that doesn't sound right. In fact, the length of
contig_3
also became 2kb longer. One thing to check is that you are comparing the graph and contigs after polishing: those would beassembly_graph.gfa
,assembly.fasta
andassembly_info.txt
files in the output folder. Also, could you send me the log file?
@AnnaSyme makes sense - there have been many improvements in 2.4. In fact, I would recommend to always use the latest available release (currently v2.6).
Closing the thread, feel free to reopen if you have any follow-ups!
Hello!
I have 2 questions about the output of Flye visualized using Bandage for a bacterial genome assembly (PacBio RSII data, genome size ~ 1.5 mb, average coverage ~ 30). I've attached the Bandage result from the final
assembly_graph.gfa
below.Flye resulted in a single contig contained in
assembly.fasta
. From the assembly graph, there appears to be a single, circular chromosome... but I'm wondering what the extra node branching off could be? Since it connects to the main chromosome, is it likely duplicated sequence?Also, in
assembly_info.txt
the circularity is listed as negative (-) but the chromosome appears circular in the Bandage graph. Why might the graph and assembly info not agree?Thank you for your help!