vgteam / libbdsg

Optimized sequence graph implementations for graph genomics
MIT License
30 stars 6 forks source link

Add overlay that accelerates path lookup on nodes #169

Closed adamnovak closed 1 year ago

adamnovak commented 1 year ago

This adds a PackedReferencePathOverlay which, in addition to what the PackedPositionOverlay does, speeds up for_each_step_on_handle() by keeping a mapping from node ID to step rank in each path's assigned PackedPositionOverlay index object.

To actually do the query, we still have to poll all the indexes which got generated in parallel; we're not creating one big mapping from node ID to relevant index. I'd like to avoid having to do any single-threaded work like that over the whole graph, because that sort of thing was too slow and we had to move to index the different paths in parallel.

I'm doing some profiling to see if this is good enough or if I really do want to coalesce the lookup tables or add another global lookup table.

adamnovak commented 1 year ago

This will fix #168

adamnovak commented 1 year ago

I did some Callgrind profiling of vg deconstruct with https://github.com/vgteam/vg/pull/3774 and it looks like now for_each_step_on_handle_impl() is about 0.03% of the actual deconstruction time. By contrast, 42% of the time is going into get_path_handle_of_step() because that has to do a locate, and 37% of time is going into locates under GBWTTraversalFinder::find_gbwt_traversals(). We're also spending like 15% of our time doing zstd stuff for no reason I can see; I think vcflib is using it to compress stuff in memory?

Anyway, I think my polling design is fine, and if we want to speed things up further we'd want to accelerate get_path_handle_of_step() somehow.

I'll go ahead and clean this up and merge this in, and then we can see if it's worth it to try and speed up step-to-path queries.

glennhickey commented 1 year ago

Awesome! Once it's merged, I'll test it on a more substantial graph.