mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
791 stars 168 forks source link

Questions about Flye assembly graph output #202

Closed russellj7 closed 4 years ago

russellj7 commented 4 years ago

Hello!

I have 2 questions about the output of Flye visualized using Bandage for a bacterial genome assembly (PacBio RSII data, genome size ~ 1.5 mb, average coverage ~ 30). I've attached the Bandage result from the final assembly_graph.gfa below.

Flye resulted in a single contig contained in assembly.fasta. From the assembly graph, there appears to be a single, circular chromosome... but I'm wondering what the extra node branching off could be? Since it connects to the main chromosome, is it likely duplicated sequence?

Also, in assembly_info.txt the circularity is listed as negative (-) but the chromosome appears circular in the Bandage graph. Why might the graph and assembly info not agree?

Thank you for your help!

Screen Shot 2020-01-07 at 12 32 08 PM

mikolmogorov commented 4 years ago

Hi,

This node is likely a single repeat that was left unresolved. So you have one long unique edge, and one short repetitive edge. Contigs are build from unique edges and are extended into repeats - so the final singe contig contains both the unique edge and one or multiple copies of the repeat. Because the repeat is already part of a contig, it is not output as a separate sequence (hence only one contig in assembly.fasta).

Contigs are only marked circular if there are no unresolved repeats. It is indeed nearly complete in this case, but you can't rule a possibility of (i) tandem repeat (so edge should be repeated) or (ii) plasmid sharing a repeat with a chromosome. It is likely that with manual analysis you can conclude that the chromosome complete or not (for example, using coverage). It is somewhat difficult to derive an automatic rule for that though.

Hope this helps, Mikhail

AnnaSyme commented 4 years ago

Hi, I'm puzzling over what might be a similar question, where contig sizes differ in the assembly (fasta file) and the assembly graph.

The assembly graph in bandage looks like this:

sweet-potato-assembly-graph

but the contigs in the scaffolds.fasta have different lengths:

Screen Shot 2020-01-09 at 16 17 35

Do you think this is from the unique contigs (= contigs 2 and 3 in the graph) being extended into repeat (= contig 1 in the graph), in an overlapping way? (suggesting additional smaller collapsed repeats)?

Thanks for any insights!

mikolmogorov commented 4 years ago

@AnnaSyme

Yes, this is what I would guess too - a circular chromosome with one unresolved repeat (from the graph structure and coverage of the edges). You can also see that the two copies of the repeat should be inverted in the genome (you need to go in both directions in the Bandage graph to traverse the entire graph). As you said, contig_2 was extended into the repeat in both directions (you can see it in the last column of the table).

AnnaSyme commented 4 years ago

@fenderglass Thank you!

In the graph, contig2 + full extension into the inverted repeat (twice) is shorter than contig2 (graph path 1,2,-1) in the assembly.fasta. Could the assembly.fasta separate out some repeats that are collapsed in the assembly graph perhaps (e.g. smaller repeats within the inverted-repeat)?

mikolmogorov commented 4 years ago

@AnnaSyme hmm, that doesn't sound right. In fact, the length of contig_3 also became 2kb longer. One thing to check is that you are comparing the graph and contigs after polishing: those would be assembly_graph.gfa, assembly.fasta and assembly_info.txt files in the output folder. Also, could you send me the log file?

russellj7 commented 4 years ago

@fenderglass In regards to my original 2 questions, with additional PacBio data it appears the repeat is resolved. (See graph below). The additional data must have contained a read(s) that spanned the unresolved repeat. The coverage for the repeat was about twice that of the rest of the chromosome, so perhaps it's a repeat with multiple copies.

Screen Shot 2020-01-13 at 9 16 48 AM

Thank you for answering my questions! And thank you for creating/maintaining Flye. It has been a great resource as an alternative to Canu or HGAP. (It produces better, faster results too 👍 )

mikolmogorov commented 4 years ago

@russellj7

Thanks for the feedback! If you are curious, you can check if the repeat was indeed resolved by aligning the repetitive sequence from the 1st assembly (you can extract it from the assemby_graph.gfa) to the new assembly.

AnnaSyme commented 4 years ago

Hi @fenderglass

Sorry I've jumped in this issue but mine may be different after all.

I was using flye version 2.3.7 where I got these differing contig sizes (between the scaffolds and the graph - both after polishing). But: using version 2.4 I don't get any differences.

(I'm guessing that in version 2.3.7 there were repeats that were collapsed in the assembly graph, but able to be resolved better (separated) in the assembly fasta file? But that newer flye versions can show these in the assembly graph?)

I will use flye versions 2.4 or onwards. Thanks for all your help!

@AnnaSyme hmm, that doesn't sound right. In fact, the length of contig_3 also became 2kb longer. One thing to check is that you are comparing the graph and contigs after polishing: those would be assembly_graph.gfa, assembly.fasta and assembly_info.txt files in the output folder. Also, could you send me the log file?

mikolmogorov commented 4 years ago

@AnnaSyme makes sense - there have been many improvements in 2.4. In fact, I would recommend to always use the latest available release (currently v2.6).

mikolmogorov commented 4 years ago

Closing the thread, feel free to reopen if you have any follow-ups!