vgteam / libhandlegraph

Library for the Handle Graph abstraction
MIT License
21 stars 5 forks source link

External iterators: for language bindings #33

Open JervenBolleman opened 5 years ago

JervenBolleman commented 5 years ago

The current handlegraph api, uses the internal iterator approach (e.g. for_each_edge etc.) This is very inconvenient when trying to use this code from python. The only way to move data from inside the iteratee to the outside is by running the iteration in a different thread than the major python code and pass out the required information using a shared blocking queue. As you can imagine this is a pain to program.

The python approach, like modern java is to use generators/streams instead. This works well in these languages because memory ownership is well defined. For C++ I imagine this approach was taken to make it easier to keep track of who owns what memory. Unfortunately, the push approach inherent in this solution is a pain to use from languages which like to use a pull model.

Could we have standard C++ iterators for each iteratee method?

subwaystation commented 5 years ago

+1

JervenBolleman commented 5 years ago

e.g. in SpOdgi we do the following

def handles(self):
        nodeId = self.odgi.min_node_id()

        maxNodeId = self.odgi.max_node_id()
        while (nodeId <= maxNodeId):
            if(self.odgi.has_node(nodeId)):
                nodeId=nodeId+1 
                yield self.odgi.get_handle(nodeId-1)

Which is the only way to iterate over all nodes in lazy way. But less than ideal performance wise.

ekg commented 5 years ago

There is for_each_handle that does the same thing. Is that not working?

On Thu, Nov 21, 2019, 18:59 JervenBolleman notifications@github.com wrote:

e.g. in SpOdgi we do the following

def handles(self): nodeId = self.odgi.min_node_id()

    maxNodeId = self.odgi.max_node_id()
    while (nodeId <= maxNodeId):
        if(self.odgi.has_node(nodeId)):
            nodeId=nodeId+1
            yield self.odgi.get_handle(nodeId-1)

Which is the only way to iterate over all nodes in lazy way. But less than ideal performance wise.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/libhandlegraph/issues/33?email_source=notifications&email_token=AABDQEPMWJ5WWIVXTP4EG4TQU3D7JA5CNFSM4JGEY6WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3DV5I#issuecomment-557202165, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIFRUK5CBW72HTVK6TQU3D7JANCNFSM4JGEY6WA .

JervenBolleman commented 5 years ago

@ekg no because you can't jump yield from within the lamda generator style, at least not in python.

So you end up having to either collect all the nodes so you can iterate over them python style at humongous memory costs. Or one has a silly iterator implementation like above.

JervenBolleman commented 2 years ago

So I came by today to open this same issue again :) it would be really nice to have std::iterator for edges,handles and paths. I think these iterators can be basic forward_only types

jeizenga commented 2 years ago

Would it be sufficient if we implemented something like the scan_path method for edges?

https://github.com/vgteam/libhandlegraph/blob/master/src/include/handlegraph/path_handle_graph.hpp#L161-L164

JervenBolleman commented 2 years ago

Yes, I really think so :100:

On Thu, Jun 23, 2022, 18:04 Jordan Eizenga @.***> wrote:

Would it be sufficient if we implemented something like the scan_path method for edges?

https://github.com/vgteam/libhandlegraph/blob/master/src/include/handlegraph/path_handle_graph.hpp#L161-L164

— Reply to this email directly, view it on GitHub https://github.com/vgteam/libhandlegraph/issues/33#issuecomment-1164600552, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHQYFO4YQYKIY4ZTGQBEKDVQSDKRANCNFSM4JGEY6WA . You are receiving this because you authored the thread.Message ID: @.***>

jeizenga commented 2 years ago

It's a tricky thing, because we haven't exposed an object that has knowledge about where it is in the container the way an iterator does (which the exception of step_handle_t). One easy option would be to copy the handles into a vector and then iterate over that. I think the biggest downside is that it would be pretty memory inefficient for any for-each loop that looped over a sizeable fraction of the graph (e.g. for_each_handle(), which does all nodes).

Thoughts?

JervenBolleman commented 2 years ago

I know. Copying and then iterating is not an option beyond test cases. I really need something new here, sorry. I have one more horrid alternative and that is to spin up a thread and use a small shared queue to load items in a buffer and iterate over that. Still we can't use raii there and will need to free the copied objects.

glennhickey commented 2 years ago

In the case of handles (and other stuff, probably), xg and PackedGraph store them in arrays. So a get_ith_handle() type method would be possible and efficienit (as would, by extension, supporting iterators that count i internally). I don't think you could get random access from HashGraph without some re-engineering, but I guess it could in theory expose its STL iterator (or some kind of wrapper for it).

JervenBolleman commented 2 years ago

I just need a STL style iterator, no need for fast random access.

jeizenga commented 2 years ago

I made a quick prototype of what a hack-y vector-based implementation could look like. Again, I have some concerns about efficiency for iterating over the entire graph, but this would probably work fine for traversing edges:

https://github.com/vgteam/libhandlegraph/blob/for-each/src/include/handlegraph/handle_graph.hpp#L256-L322