jasperlinthorst / reveal

Graph based multi genome aligner
MIT License
45 stars 3 forks source link

applicability to large genomes #16

Open ekg opened 6 years ago

ekg commented 6 years ago

Is reveal capable of generating whole genome alignments for multi-gigabase genomes?

jasperlinthorst commented 6 years ago

Hi Erik, Short answer: Yes. Of course this depends a lot on the available amount of RAM, but I I've been using REVEAL to build graphs for multiple human genome assemblies. I've haven't really used it to do cross species comparisons (other than Chimp vs Human), but as long as there's enough conserved sequence it should work.

In general I take the following approach (note the --64 between reveal and the subcommand for large assemblies):

This should result in a graph that encodes inversions/translocations/misassemblies in a graph with two paths corresponding to the original and the 'reference' layout of the draft genome. This graph can then be (multi) aligned using:

Note that this constructs an index for all the input genomes/graphs (no pairwise approach), so this might become very big for very large genomes. However, it's probably more efficient to do this per chromosome anyway, so that's why I mostly split the graph per chromosome first:

And then use "reveal align" to multi-align the chromosomes separately.

Finally, depending on the use-case of the graph, I use:

To use multiple sequence alignment to realign bubbles or other parts of the graph where more 'variant resolution' is needed.

That said, REVEAL is work in progress and can be improved in many ways. Feedback is always appreciated and I'm happy to help out where ever I can.

Best, Jasper

jasperlinthorst commented 6 years ago

I now forgot to mention that of course, to construct graphs for multiple genomes you can also create an alignment between graphs in subsequent passes without depending on the construction of an index for multiple genomes simultaneously. So you can also take the pairwise or iterative approach, if memory is the problem...

ekg commented 6 years ago

Thank you for the detailed response. Do you have scripts that show exactly what steps you took for human?

I realize now that REVEAL does not generate cyclic graphs. These can be essential for encoding CNVs. Have you considered extensions that would enable this?

jasperlinthorst commented 6 years ago

Hi Erik, Sorry for my late reply, I had a small holiday. About the scripts, no, not really, but I'll try to set up a wiki to describe the intermediate steps for generating a multi-genome graph for some publicly available datasets shortly.

About cycles and CNVs, I guess it depends on the scale of things. I don't think that allowing/introducing cycles is always the best way to cope with CNVs like STRs and VNTRs, as in practice I tend to see that, especially in the case of VNTRs, the repeat pattern is not as exact as expected. In these cases a cyclic representation would actually mean a loss information. However, I agree that with short reads and STRs this might be of added value, so maybe I could indeed add this as a sort of extension to existing graphs.

REVEAL does generate cyclic graphs, but only to enable the representation of structural rearrangements like large inversions and translocations. This is done by the 'finish' subcommand (with the --outputgraph parameter).

Best, Jasper