mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
743 stars 164 forks source link

Positive and Negative Sense concatenating together for each contig #674

Closed BABRIGGS closed 1 month ago

BABRIGGS commented 4 months ago

flye.log

When I assemble various bacterial genomes each contig/genome seems to be doubled in size due to the entire contig being repeated and then concatenated together. One repeat seems to be the positive sense strand while the other seems to be the negative sense strand. This has happened to circular and linear contigs ranging from 16Kb to 10Mb.

Is there a way to combat this or correct it?

I have included the entire Flye log with this post.

mikolmogorov commented 4 months ago

Hello,

In your email you mentioned a 5Mb duplication, but your assembly size is 1.9 Mb. Is this the same dataset?

Which contigs have duplications and how did you detect this duplication?

Best, Misha

BABRIGGS commented 4 months ago

Hi, we have multiple genomes within that sample. (We have two bacteria and Vero cell DNA) which is why we have to use the meta argument and why it is larger assembly. Based on our analysis it looks like every contig is doubled to some extent, if not completely.

We detected this through BLAST/RAST and our own gene analyses; almost every gene that is present on each contig is duplicated.

Any insight would be helpful.

Thanks, Barrett Briggs

On Feb 19, 2024, at 9:43 AM, Mikhail Kolmogorov @.***> wrote:

Hello,

In your email you mentioned a 5Mb duplication, but your assembly size is 1.9 Mb. Is this the same dataset?

Which contigs have duplications and how did you detect this duplication?

Best, Misha

— Reply to this email directly, view it on GitHub https://github.com/fenderglass/Flye/issues/674#issuecomment-1952594478, or unsubscribe https://github.com/notifications/unsubscribe-auth/BGGXWJXVVKWHD3Y236NRJK3YUNQJ5AVCNFSM6AAAAABDMMZW6WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJSGU4TINBXHA. You are receiving this because you authored the thread.

mikolmogorov commented 4 months ago

If you have two bacteria in the sample, it is likely that Flye actually recovers two separate genome. What exactly do you mean by doubled and how do you quantify this?

BABRIGGS commented 4 months ago

We are getting the complete genomes for each bacteria in the sample. When we look at the individual contigs that Flye exports, each contig is its own genome for our various bacteria. Within the contig the sequences are doubled. We looked at gene analysis on RAST and we blasted the sequences to visualize this. For example, looking at individual contig 1, it is around 10Mb. When we look at the sequence/genes, they are doubled and we think the actual size of this contig is around 5Mb (which would be the complete genome for our paenibacillus). Based on blast analysis it looks like Flye is duplicating it by linking the positive and negative sense strands together.

We are seeing this with other assemblies as well. We have a Borrelia burgdorferi sample and since Borrelia has a linear and segmented genome that means we should have 20-24 separate contigs of varying lengths (5kb-900kb). When we look at each of these contigs several of them are doubled in their expected sizes, displaying a similar situation as I described above/submitted.

Any insight would be helpful, Thanks

On Feb 23, 2024, at 1:46 PM, Mikhail Kolmogorov @.***> wrote:

If you have two bacteria in the sample, it is likely that Flye actually recovers two separate genome. What exactly do you mean by doubled and how do you quantify this?

— Reply to this email directly, view it on GitHub https://github.com/fenderglass/Flye/issues/674#issuecomment-1961822893, or unsubscribe https://github.com/notifications/unsubscribe-auth/BGGXWJV24AKIONRBXN35G5LYVDPXHAVCNFSM6AAAAABDMMZW6WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRHAZDEOBZGM. You are receiving this because you authored the thread.

BABRIGGS commented 4 months ago

I was wondering if you had any update on this? Our lab group is waiting to see if we need to reassemble with different parameters or if we should move forward with our genome analyses.

Thanks, Barrett

On Feb 23, 2024, at 2:14 PM, Barrett Briggs @.***> wrote:

We are getting the complete genomes for each bacteria in the sample. When we look at the individual contigs that Flye exports, each contig is its own genome for our various bacteria. Within the contig the sequences are doubled. We looked at gene analysis on RAST and we blasted the sequences to visualize this. For example, looking at individual contig 1, it is around 10Mb. When we look at the sequence/genes, they are doubled and we think the actual size of this contig is around 5Mb (which would be the complete genome for our paenibacillus). Based on blast analysis it looks like Flye is duplicating it by linking the positive and negative sense strands together.

We are seeing this with other assemblies as well. We have a Borrelia burgdorferi sample and since Borrelia has a linear and segmented genome that means we should have 20-24 separate contigs of varying lengths (5kb-900kb). When we look at each of these contigs several of them are doubled in their expected sizes, displaying a similar situation as I described above/submitted.

Any insight would be helpful, Thanks

On Feb 23, 2024, at 1:46 PM, Mikhail Kolmogorov @.***> wrote:

If you have two bacteria in the sample, it is likely that Flye actually recovers two separate genome. What exactly do you mean by doubled and how do you quantify this?

— Reply to this email directly, view it on GitHub https://github.com/fenderglass/Flye/issues/674#issuecomment-1961822893, or unsubscribe https://github.com/notifications/unsubscribe-auth/BGGXWJV24AKIONRBXN35G5LYVDPXHAVCNFSM6AAAAABDMMZW6WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRHAZDEOBZGM. You are receiving this because you authored the thread.

mikolmogorov commented 4 months ago

Hi Barrett,

Sorry for my late response. I am not sure I 100% understand what do you mean by doubling. How do you quantify this? Is there an output of a tool, from which you conclude that there is doubling? Can you share and describe your interpretation?

Flye in general should not produce extensive duplications and I have not seen anything like this before. Could you please produce dot-plots of the large contigs (1mb+) that you think are duplicated? E.g. with tools like MashMap or gepard (https://cube.univie.ac.at/gepard)?

BABRIGGS commented 4 months ago

Almost all of our contigs are going to be smaller than 1Mb. Would that place them in the category of smaller contigs that you have seen repeats in before?

Our entire Borrelia genome is about 1.5Mb that is broken into 1 chromosome and numerous plasmids.

We know the main chromosome should be about 920Kbps. Flye out put a contig that was 1.16Mbs. The dot plot for this contig is below (reference on X axis and our contig on the Y Axis). You can see that about 1/3 of the genome is repeated in the opposite sense. This is contig 23 only, from Flye outputs. 

Here is a an example of our Borrelia plasmid lp38. The contig should be about 38Kbps, however Flye spit out a contig that its 66Kbps. The dot plot is below (ref on x axis and our sequence on the Y axis). You can see almost the entire sequence is duplicated. This is Contigs 24,25,27 scaffolded together.

Finally, a third example. This is lp25, the sequence length should be about 25Kbps, but Flye spit out a contig that is 48Kbps. The dot plot In the same orientation as above is below. This is contig 38 only 

These 3 examples are from the same assembly (but not the one that I submitted on GitHub). I have attached the Flye log here. 

All of these contigs are supposed to be less than 1Mb, and only the first one, the chromosome, did Flye create one that is bigger than that, as well.

All of these contigs are linear as well, so we don't see a reason for there to be some overlap from the circular notion.

I am not sure if this helps you understand our issue better. We are getting contigs that have sequences duplicated that we know should not be duplicated, as show in these dot plots above. The duplicates seem to be coming from linking the positive and negative sense DNA sequences together, as you can see with the dot plots.

I can not explicitly show you the contig 1 that I have referenced before as we do not have a reference genome for it.

Let me know if this helps at all, and if you have a proposed solution. Thanks again, Barrett

On Mar 5, 2024, at 9:46 AM, Mikhail Kolmogorov @.***> wrote:

Hi Barrett,

Sorry for my late response. I am not sure I 100% understand what do you mean by doubling. How do you quantify this? Is there an output of a tool, from which you conclude that there is doubling? Can you share and describe your interpretation?

Flye in general should not produce extensive duplications and I have not seen anything like this before. Could you please produce dot-plots of the large contigs (1mb+) that you think are duplicated? E.g. with tools like MashMap or gepard (https://cube.univie.ac.at/gepard)?

— Reply to this email directly, view it on GitHub https://github.com/fenderglass/Flye/issues/674#issuecomment-1978938961, or unsubscribe https://github.com/notifications/unsubscribe-auth/BGGXWJWNE4OFDFJWYJ3FNY3YWXLFRAVCNFSM6AAAAABDMMZW6WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZYHEZTQOJWGE. You are receiving this because you authored the thread.

mikolmogorov commented 3 months ago

Hi @BABRIGGS

I unfortunately can't see the uploaded images.. I see that you responded via email - could you please try to add the images via github (https://github.com/fenderglass/Flye/issues/674)?

Thanks, Misha

BABRIGGS commented 3 months ago

I have attached the images and flyle log below:

EX1:

ex1

EX2:

ex2

EX3:

ex3

Flye Log flye.log

mikolmogorov commented 3 months ago

Thank you, this info is very helpful!

I would separate duplications in plasmids and chromosomal sequences. Duplications of plasmids is indeed a known issue with Flye. It is mostly relevant to circular sequences under 100kb. Here is a nice writeup by Ryan Wick that covers plasmids, but has some other useful info: https://rrwick.github.io/2020/10/30/guide-to-bacterial-genome-assembly.html

It is strange however that plasmids are duplicated into the opposite strand. Is it possible that these are linear plasmids, with some kind of inverted terminal repeats? And maybe captured at a weird stage of their replication cycle?

I have the same suspicion about your chromosome. In general, it is definitely unexpected for Flye to make artificial duplications of 100kb+. Could these be inverted terminal repeats?

It would be helpful if you can also upload the Bandage visualization of assembly_graph.gfa and assembly_info.txt file.

Thanks, Misha

BABRIGGS commented 3 months ago

I have uploaded the info txt file, however, GitHub not let me upload the gfa file. assembly_info.txt

The info file also indicates plasmids that are supposed to be linear as circular (contig 29, 35, 34,36,32, and 38). Contig 37 should be the only circular one. Contigs 17, 18, 19, 20, 21, and 22 make up a group of highly similar circular plasmids. We expected to have difficulty assembling them so I am not as concerned with those, but the others should not be displaying circular. Could that be why they are showing up duplicated? If that is the case, is there a way to fix that?

I would be highly suspicious of ITRs as there are duplicated genes that should not be at the ends of strands of the DNA. It is more so that the entire contig is duplicated/repeated.

I will take a look at RRwick's info.

Thanks so much again, Barrett

mikolmogorov commented 3 months ago

Thanks Barrett,

For gfa - I don't need the actual file, but could you please use this tool to visualize and just post the image? https://github.com/rrwick/Bandage

From the assmebly_info, I see that there is definitely a mix of linear and circular. What I would also do - try visualizing read alignments against the assembly in IGV. You can check the alignments around the areas where you expect chromosome to end. If it's an artifact, you should see none or a few reads spanning these positions. Feel free to post those as well.

BABRIGGS commented 3 months ago

Here is my bandage file for the assembly that I sent the dot plots for. graph.

Our lab is not familiar with IGV, besides the assembly file, what files should I be using for the read alignments to visualize this?

Thanks, Barrett

mikolmogorov commented 3 months ago

Barrett - you'll need to realign original reads against the assembly and use bam file as an IGV input.

amgroth commented 3 months ago

I've seen a similar pattern with one of my assemblies as well. The bandage image is below, edge 2 and 3 are the contigs of interest.I have screen shots of a BLAST dotplot (self to self) and coverage from Minimap visualized in Geneious. Edge 2 is the first pair of images. Edge 3 is the second pair of images.

contig 2 coverage Contig 2 dotplot contig 3 coverage contig 3 dotplot

graph

Thanks- Adam

mikolmogorov commented 3 months ago

Adam - these could be self-complementary repeats (e.g. ATATATATA).

mikolmogorov commented 1 month ago

Closing due to inactivity - feel free to follow up if you have more questions!