vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

version 2 goals #1952

Open ekg opened 5 years ago

ekg commented 5 years ago

Right now, vg is a strong reference implementation of basic ideas in graphical pangenomics, but its performance suffers in a number of contexts due to a number of design decisions that are hard to unwind. Although we should hope for the development of new methods that are optimized to particular use cases (mapping, for instance), it is in our interest to improve the current system to remove a number of remaining limitations on its use. vg will remain useful as a collection of algorithms applicable to variation graphs for some time.

In my opinion, the biggest problem we currently have is the memory usage of the VG class. This remains a serious bottleneck in any pipeline based around vg. This should have been resolved in the beginning, but poor design decisions on my part resulted in a very bloated system that uses far more memory than is needed to store an accessible, mutable graph. @adamnovak, @jeizenga and others have worked to build a generic interface that encodes all the access and mutation patterns which are important for working wit these graphs (the HandleGraph, handle.hpp), and the VG class is now compatible with this. However, a number of algorithms still depend on raw pointers into a backing (protobuf!) array storing the graph. Unfortunately, we will simply have to adjust all of these to the handle graph interface, which should require significant effort but will decouple the algorithms which we have spent so long developing from this particular graph storage implementation. This will give them a much longer life, and I think is worth doing for ourselves an the community.

How would this work proceed? First, we need a new dynamic graph storage engine to replace the six hash_maps in VG. Do others have any ideas about how to do this? I am interested in taking this on in the coming months, and I have some ideas that I will later describe here. I think it will be good to build this independently, and then pull it in. Perhaps this can be done with a single-header library, for a generic bioinformatic developer audience. Once this is built, we can start to lace it into the system, remove reference to Node and Edge, and be on our merry way. A wrinkle will be the handling of paths--- presumably these will need to be part of this implementation.

The other major issue blocking vg's use is runtime of common operations related to read mapping. I guess there is less to say about this other than that work on this should continue, improvements to xdrop alignment allowing pair rescue can help, and that we should consider introducing a minimizer-based index to optionally replace the GCSA2 index. The latter could be relatively easy to build, and we may be able to plug minimizer hits directly into the current mapping frameworks with minimal tweaking, as in the end the hits are exact matches. This may be less true for mpmap, but I am curious what @jeizenga thinks.

This thread is meant to help us establish a longer term plan for work on vg in the coming six months. At present it looks like many people are applying vg to a diverse set of topics. My hope is that their work can be undisturbed by great improvements in the runtime and memory bounds of the software.

It is certainly interesting to consider rewriting pieces of vg as independent modules. We also shouldn't be afraid of doing this. I think it will make for a healthier ecosystem if we can work around GFA and some kind of yet-undecided GAM-like format describing alignments. I do think that enough momentum has built up behind vg that it would be a mistake to abandon it in favor of this kind of approach. The mass of reference implementations of algorithms on variation graphs that it represents means that improving its performance is only a good thing. Both directions can be pursued in parallel.

CC @richarddurbin, @benedictpaten

adamnovak commented 5 years ago

This is an excellent writeup Erik!

In terms of breaking up vg, I wouldn't mind decomposing it a little to make it more like Git, where subcommands can be separate binaries, rather than having a single monolithic binary that does everything. It might make installation harder (although we probably could still build a monolithic static binary with everything in it), but it would speed up the build if we didn't have to write a 300 MB binary every time anything changed.

As for a new dynamic graph engine, I would like to see something that was based around memory-mapped file IO (or at least around random access, if we want to support HTTP), for O(1) open and close. I would also love to see it number every base, so splitting nodes no longer requires doing any translation of IDs.

I would also love to see an append-only way of representing a graph with modifications, on the theory that that would force modifications to somehow be efficient, but that's probably its own research project.

On 10/23/18, Erik Garrison notifications@github.com wrote:

Right now, vg is a strong reference implementation of basic ideas in graphical pangenomics, but its performance suffers in a number of contexts due to a number of design decisions that are hard to unwind. Although we should hope for the development of new methods that are optimized to particular use cases (mapping, for instance), it is in our interest to improve the current system to remove a number of remaining limitations on its use. vg will remain useful as a collection of algorithms applicable to variation graphs for some time.

In my opinion, the biggest problem we currently have is the memory usage of the VG class. This remains a serious bottleneck in any pipeline based around vg. This should have been resolved in the beginning, but poor design decisions on my part resulted in a very bloated system that uses far more memory than is needed to store an accessible, mutable graph. @adamnovak, @jeizenga and others have worked to build a generic interface that encodes all the access and mutation patterns which are important for working wit these graphs (the HandleGraph, handle.hpp), and the VG class is now compatible with this. However, a number of algorithms still depend on raw pointers into a backing (protobuf!) array storing the graph. Unfortunately, we will simply have to adjust all of these to the handle graph interface, which should require significant effort but will decouple the algorithms which we have spent so long developing from this particular graph storage implementation. This will give them a much longer life, and I think is worth doing for ourselves an the community.

How would this work proceed? First, we need a new dynamic graph storage engine to replace the six hash_maps in VG. Do others have any ideas about how to do this? I am interested in taking this on in the coming months, and I have some ideas that I will later describe here. I think it will be good to build this independently, and then pull it in. Perhaps this can be done with a single-header library, for a generic bioinformatic developer audience. Once this is built, we can start to lace it into the system, remove reference to Node and Edge, and be on our merry way. A wrinkle will be the handling of paths--- presumably these will need to be part of this implementation.

The other major issue blocking vg's use is runtime of common operations related to read mapping. I guess there is less to say about this other than that work on this should continue, improvements to xdrop alignment allowing pair rescue can help, and that we should consider introducing a minimizer-based index to optionally replace the GCSA2 index. The latter could be relatively easy to build, and we may be able to plug minimizer hits directly into the current mapping frameworks with minimal tweaking, as in the end the hits are exact matches. This may be less true for mpmap, but I am curious what @jeizenga thinks.

This thread is meant to help us establish a longer term plan for work on vg in the coming six months. At present it looks like many people are applying vg to a diverse set of topics. My hope is that their work can be undisturbed by great improvements in the runtime and memory bounds of the software.

It is certainly interesting to consider rewriting pieces of vg as independent modules. We also shouldn't be afraid of doing this. I think it will make for a healthier ecosystem if we can work around GFA and some kind of yet-undecided GAM-like format describing alignments. I do think that enough momentum has built up behind vg that it would be a mistake to abandon it in favor of this kind of approach. The mass of reference implementations of algorithms on variation graphs that it represents means that improving its performance is only a good thing. Both directions can be pursued in parallel.

CC @richarddurbin, @benedictpaten

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/vgteam/vg/issues/1952

richarddurbin commented 5 years ago

Good to be thinking about these things, and writing down ideas.

I also like the idea of conceptually operating at the base level as the atomic unit. The problem as I understand it has been efficiency of implementation, but perhaps that can be left to the level below the interface. Can someone point me to the HandleGraph interface?

Richard

On 23 Oct 2018, at 17:18, Adam Novak notifications@github.com wrote:

This is an excellent writeup Erik!

In terms of breaking up vg, I wouldn't mind decomposing it a little to make it more like Git, where subcommands can be separate binaries, rather than having a single monolithic binary that does everything. It might make installation harder (although we probably could still build a monolithic static binary with everything in it), but it would speed up the build if we didn't have to write a 300 MB binary every time anything changed.

As for a new dynamic graph engine, I would like to see something that was based around memory-mapped file IO (or at least around random access, if we want to support HTTP), for O(1) open and close. I would also love to see it number every base, so splitting nodes no longer requires doing any translation of IDs.

I would also love to see an append-only way of representing a graph with modifications, on the theory that that would force modifications to somehow be efficient, but that's probably its own research project.

On 10/23/18, Erik Garrison notifications@github.com wrote:

Right now, vg is a strong reference implementation of basic ideas in graphical pangenomics, but its performance suffers in a number of contexts due to a number of design decisions that are hard to unwind. Although we should hope for the development of new methods that are optimized to particular use cases (mapping, for instance), it is in our interest to improve the current system to remove a number of remaining limitations on its use. vg will remain useful as a collection of algorithms applicable to variation graphs for some time.

In my opinion, the biggest problem we currently have is the memory usage of the VG class. This remains a serious bottleneck in any pipeline based around vg. This should have been resolved in the beginning, but poor design decisions on my part resulted in a very bloated system that uses far more memory than is needed to store an accessible, mutable graph. @adamnovak, @jeizenga and others have worked to build a generic interface that encodes all the access and mutation patterns which are important for working wit these graphs (the HandleGraph, handle.hpp), and the VG class is now compatible with this. However, a number of algorithms still depend on raw pointers into a backing (protobuf!) array storing the graph. Unfortunately, we will simply have to adjust all of these to the handle graph interface, which should require significant effort but will decouple the algorithms which we have spent so long developing from this particular graph storage implementation. This will give them a much longer life, and I think is worth doing for ourselves an the community.

How would this work proceed? First, we need a new dynamic graph storage engine to replace the six hash_maps in VG. Do others have any ideas about how to do this? I am interested in taking this on in the coming months, and I have some ideas that I will later describe here. I think it will be good to build this independently, and then pull it in. Perhaps this can be done with a single-header library, for a generic bioinformatic developer audience. Once this is built, we can start to lace it into the system, remove reference to Node and Edge, and be on our merry way. A wrinkle will be the handling of paths--- presumably these will need to be part of this implementation.

The other major issue blocking vg's use is runtime of common operations related to read mapping. I guess there is less to say about this other than that work on this should continue, improvements to xdrop alignment allowing pair rescue can help, and that we should consider introducing a minimizer-based index to optionally replace the GCSA2 index. The latter could be relatively easy to build, and we may be able to plug minimizer hits directly into the current mapping frameworks with minimal tweaking, as in the end the hits are exact matches. This may be less true for mpmap, but I am curious what @jeizenga thinks.

This thread is meant to help us establish a longer term plan for work on vg in the coming six months. At present it looks like many people are applying vg to a diverse set of topics. My hope is that their work can be undisturbed by great improvements in the runtime and memory bounds of the software.

It is certainly interesting to consider rewriting pieces of vg as independent modules. We also shouldn't be afraid of doing this. I think it will make for a healthier ecosystem if we can work around GFA and some kind of yet-undecided GAM-like format describing alignments. I do think that enough momentum has built up behind vg that it would be a mistake to abandon it in favor of this kind of approach. The mass of reference implementations of algorithms on variation graphs that it represents means that improving its performance is only a good thing. Both directions can be pursued in parallel.

CC @richarddurbin, @benedictpaten

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/vgteam/vg/issues/1952 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_vgteam_vg_issues_1952-23issuecomment-2D432313284&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=tpVWM6yXWuuxGHIzuPCXfAHKE3tIXBdU48TDCIzv4Ag&s=BkeBckuVGsbo_U1H1CYeAG5YWf8XdmTRmT0sypTI3mE&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADRb5mM-5FS6F0Vb-2D6efWnKaS4XcTLCq8lks5un0FKgaJpZM4X1qxs&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=tpVWM6yXWuuxGHIzuPCXfAHKE3tIXBdU48TDCIzv4Ag&s=shRlzNXKskSkU-EgTh9qBc8hQHOAUIm4Rk7upyi_CzU&e=.

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

adamnovak commented 5 years ago

The HandleGraph documentation lives here:

http://vgteam.github.io/vg/classvg_1_1HandleGraph.html

On 10/23/18, Richard Durbin notifications@github.com wrote:

Good to be thinking about these things, and writing down ideas.

I also like the idea of conceptually operating at the base level as the atomic unit. The problem as I understand it has been efficiency of implementation, but perhaps that can be left to the level below the interface. Can someone point me to the HandleGraph interface?

Richard

On 23 Oct 2018, at 17:18, Adam Novak notifications@github.com wrote:

This is an excellent writeup Erik!

In terms of breaking up vg, I wouldn't mind decomposing it a little to make it more like Git, where subcommands can be separate binaries, rather than having a single monolithic binary that does everything. It might make installation harder (although we probably could still build a monolithic static binary with everything in it), but it would speed up the build if we didn't have to write a 300 MB binary every time anything changed.

As for a new dynamic graph engine, I would like to see something that was based around memory-mapped file IO (or at least around random access, if we want to support HTTP), for O(1) open and close. I would also love to see it number every base, so splitting nodes no longer requires doing any translation of IDs.

I would also love to see an append-only way of representing a graph with modifications, on the theory that that would force modifications to somehow be efficient, but that's probably its own research project.

On 10/23/18, Erik Garrison notifications@github.com wrote:

Right now, vg is a strong reference implementation of basic ideas in graphical pangenomics, but its performance suffers in a number of contexts due to a number of design decisions that are hard to unwind. Although we should hope for the development of new methods that are optimized to particular use cases (mapping, for instance), it is in our interest to improve the current system to remove a number of remaining limitations on its use. vg will remain useful as a collection of algorithms applicable to variation graphs for some time.

In my opinion, the biggest problem we currently have is the memory usage of the VG class. This remains a serious bottleneck in any pipeline based around vg. This should have been resolved in the beginning, but poor design decisions on my part resulted in a very bloated system that uses far more memory than is needed to store an accessible, mutable graph. @adamnovak, @jeizenga and others have worked to build a generic interface that encodes all the access and mutation patterns which are important for working wit these graphs (the HandleGraph, handle.hpp), and the VG class is now compatible with this. However, a number of algorithms still depend on raw pointers into a backing (protobuf!) array storing the graph. Unfortunately, we will simply have to adjust all of these to the handle graph interface, which should require significant effort but will decouple the algorithms which we have spent so long developing from this particular graph storage implementation. This will give them a much longer life, and I think is worth doing for ourselves an the community.

How would this work proceed? First, we need a new dynamic graph storage engine to replace the six hash_maps in VG. Do others have any ideas about how to do this? I am interested in taking this on in the coming months, and I have some ideas that I will later describe here. I think it will be good to build this independently, and then pull it in. Perhaps this can be done with a single-header library, for a generic bioinformatic developer audience. Once this is built, we can start to lace it into the system, remove reference to Node and Edge, and be on our merry way. A wrinkle will be the handling of paths--- presumably these will need to be part of this implementation.

The other major issue blocking vg's use is runtime of common operations related to read mapping. I guess there is less to say about this other than that work on this should continue, improvements to xdrop alignment allowing pair rescue can help, and that we should consider introducing a minimizer-based index to optionally replace the GCSA2 index. The latter could be relatively easy to build, and we may be able to plug minimizer hits directly into the current mapping frameworks with minimal tweaking, as in the end the hits are exact matches. This may be less true for mpmap, but I am curious what @jeizenga thinks.

This thread is meant to help us establish a longer term plan for work on vg in the coming six months. At present it looks like many people are applying vg to a diverse set of topics. My hope is that their work can be undisturbed by great improvements in the runtime and memory bounds of the software.

It is certainly interesting to consider rewriting pieces of vg as independent modules. We also shouldn't be afraid of doing this. I think it will make for a healthier ecosystem if we can work around GFA and some kind of yet-undecided GAM-like format describing alignments. I do think that enough momentum has built up behind vg that it would be a mistake to abandon it in favor of this kind of approach. The mass of reference implementations of algorithms on variation graphs that it represents means that improving its performance is only a good thing. Both directions can be pursued in parallel.

CC @richarddurbin, @benedictpaten

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/vgteam/vg/issues/1952 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_vgteam_vg_issues_1952-23issuecomment-2D432313284&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=tpVWM6yXWuuxGHIzuPCXfAHKE3tIXBdU48TDCIzv4Ag&s=BkeBckuVGsbo_U1H1CYeAG5YWf8XdmTRmT0sypTI3mE&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADRb5mM-5FS6F0Vb-2D6efWnKaS4XcTLCq8lks5un0FKgaJpZM4X1qxs&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=tLrRui-nzNUodb5ShCAULA&m=tpVWM6yXWuuxGHIzuPCXfAHKE3tIXBdU48TDCIzv4Ag&s=shRlzNXKskSkU-EgTh9qBc8hQHOAUIm4Rk7upyi_CzU&e=.

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/vgteam/vg/issues/1952#issuecomment-432465532

alancleary commented 5 years ago

Hey there. Just thought I would chip in my $0.02

It is certainly interesting to consider rewriting pieces of vg as independent modules

Indeed it is. This could greatly improve building against specific vg functionality, such as the protobuf (de)serialization code. In other words, if you insist on using your own protobuf (de)serialization scheme, then it aught to be easily pluggable into other, non-C(++) contexts ;)