jasperlinthorst / reveal

Graph based multi genome aligner
MIT License
45 stars 3 forks source link

Collapsing repetitive regions #10

Closed ChriKub closed 6 years ago

ChriKub commented 7 years ago

Hi, I'm aligning 7 genomes. Each of the genomes contain repetitive elements which I would expect to align with its self in the same genome, so I would end up with segments in the gfa that are traversed more then 7 times. In my alignments the max number of traversals through a segment is 7. Is this a wanted behaviour and is there any way to allow for self-alignments?

Thanks, Chris

jasperlinthorst commented 7 years ago

Hi Chris, Yes, this is intended behaviour. With reveal I aim to only align sequence that (locally) matches uniquely given the hierarchical decomposition of the alignment. As a result you should always end up with a directed acyclic graph. See the preprint for more details. For now I don't foresee any efforts to allow self-alignments (or loops) in the graph, maybe a 'de bruijn' graph approach might be more suited for your approach?

Cheers, Jasper

fbemm commented 7 years ago

I do understand the principal idea behind that approach. Downside, you loose every element that has a very recent duplication history. Bit of a bummer when you look into a typical Eukaryota genome no?

ChriKub commented 7 years ago

Hi Jasper, thanks for the quick response. I get your idea, but as @fbemm mentioned you will lose information on duplication events. Dealing with repetitive sequences is tricky but the additional information that is obtainable is worthwhile (at least in my opinion).

Cheers, Chris

jasperlinthorst commented 7 years ago

Well, I don't think you lose any information. I think it depends on the biological question. If you multi-align 7 genomes, reveal will give you a graph that in the form of bubbles describes how those genomes differ from each other. For instance, your duplication event should pop up as an insertion bubble within the graph. With the sequence that is contained in these bubbles you can perform subsequent alignments to figure out how a set of paralogous genes differ from each other. Maybe we can think of a graph based representation that tries to incorporate all this information, but I don't think that's an easy problem...