vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

Linearization algorithm #414

Open Kitonick79 opened 8 years ago

Kitonick79 commented 8 years ago

Dmitrii Should we replace algorithm in vg code or implement it as standalone command line utility?

Erik Implementing solutions in vg is beneficial now, but we also want to support an ecosystem of tools that all read and write common formats. This makes maintenance easier and gives people more control over their code, which may be relevant for certain algorithms and methods. For instance, you may want to work with the graph using a novel disk-backed storage system such as GraphChi. If you do, the only thing we need in common is the data format. The vg schema is relatively stable, and in any case should remain backwards compatible into the forseeable future, so it might be the right time to start exploring this.

Dmitrii Would you recommend implementing it as standalone application? We are going to explore GraphChi and test its performance on large number of nodes.

ekg commented 8 years ago

If you manage to bridge between variation graphs and GraphChi, we could start to move other memory-intensive algorithms into it. That might be a very fruitful direction.

It is important to the project that we handle the same kind of graph in GraphChi as we can handle in vg. This would mean a graph that is bidirectional (in other words: it has a forward and reverse complement and four kinds of edges +/+, +/-, -/+, -/-) and is cyclic (not a DAG).

Do you think this could be done? I am not familiar with GraphChi and am curious what you are thinking to do.

ekg commented 8 years ago

After taking a look at GraphChi this seems like a very possible direction. I will attempt to implement a bridge today to read in vg format and run some basic algorithms on a few test cases using the provided examples in their repository. Algorithms like strongly connected component detection should be straightforward to apply.

Kitonick79 commented 8 years ago

Although we have no actual experience in the team with GraphChi. we are going to try VG with large number of nodes (1 billion possibly) and measure time and space used with and without GraphChi. We see the problem is in architecture of VG, which is a solid piece of software performing many tasks using plenty of algorithms. It is possibly that GraphChi will improve performance of reading and writing a graph, but we are not sure about alignment.

ekg commented 8 years ago

It will probably be faster to work with 1 billion nodes when they are in memory, but in VG doing this will require upwards of 500G of RAM (maybe up to 1T... VG is very memory inefficient). While we should focus some on ways to bring down VG's memory usage, it is not frequently that we need to load the entire graph into VG. That class is designed for small-scale operations on graphs of less than 100M bases. We can do a lot of things on DAGs on small local windows, so unless we are working with a complex assembly graph with a high degree of connectivity, we shouldn't need to load the entire graph into memory any more than we need to load the entire human reference genome when looking at a particular gene.

That said, there are cases where we want to do large scale operations on the graph. Sorting, canonicalization, and normalization algorithms seem essential in a workflow that utilizes population scale assembly graphs. These graphs will not be easy to subset in VG. However, we could use GraphChi as a base framework to iterate across them and edit them in an out of core fashion. This might mean writing nothing more than a converter that maps from .vg format into that used by GraphChi, and writing algorithms entirely using the API provided by GraphChi. Or, we could consider embedding VG (the class) inside of programs written in the GraphChi framework.

ekg commented 8 years ago

It is possibly that GraphChi will improve performance of reading and writing a graph, but we are not sure about alignment.

In alignment, we are using the GCSA2 index to find maximal exact matches, then using the XG index to get clusters of matches that are nearby, picking up the subgraph in the region of such matches, and aligning to it with the VG class. So, at no point in alignment do we need to work on a large portion of the graph and the current architecture is fine.

It's more for whole-graph processing that we'd probably want to think about backing things with GraphChi. There are a lot of "whole" graph operations which we might need to support in standard workflows, so the more I think about the problem with VG's memory architecture, the more I feel that backing such algorithms into GraphChi makes sense.

Kitonick79 commented 8 years ago

@ekg, We are planning to release the module with linearization algorithm on Tuesday or Wednesday. As I'm leaving the office for vacation untill 12 August, @mzueva, will release the code and release notes. Please, provide a guidance on the deployment procedure to the repo or fork.

ekg commented 8 years ago

@Kitonick79 the standard procedure is to fork this repository, make your changes on the fork, and use github's pull request feature to bring it back into the master of this repository. Initiating the pull request will run a test suite on the pre and post merge states of the repository. If these pass then we can safely merge it in.

Please do add some test(s) to demonstrate functionality. This could be as simple as running the module and ensuring it exits normally. Ideally, it would be something that verifies that the algorithm itself is working. Tests are called via scripts in: https://github.com/vgteam/vg/tree/master/test/t using the bash-tap library. You can also add a unit test by hooking into the unit testing pattern that @adamnovak has recently added: https://github.com/vgteam/vg/tree/master/src/unittest

mzueva commented 8 years ago

Hello, Erik @ekg! I'm preparing code for pull request now, making some refactoring and adding tests. I would like to check, do you have some certain requirements for the code style?