Open JervenBolleman opened 5 years ago
+1
e.g. in SpOdgi we do the following
def handles(self):
nodeId = self.odgi.min_node_id()
maxNodeId = self.odgi.max_node_id()
while (nodeId <= maxNodeId):
if(self.odgi.has_node(nodeId)):
nodeId=nodeId+1
yield self.odgi.get_handle(nodeId-1)
Which is the only way to iterate over all nodes in lazy way. But less than ideal performance wise.
There is for_each_handle that does the same thing. Is that not working?
On Thu, Nov 21, 2019, 18:59 JervenBolleman notifications@github.com wrote:
e.g. in SpOdgi we do the following
def handles(self): nodeId = self.odgi.min_node_id()
maxNodeId = self.odgi.max_node_id() while (nodeId <= maxNodeId): if(self.odgi.has_node(nodeId)): nodeId=nodeId+1 yield self.odgi.get_handle(nodeId-1)
Which is the only way to iterate over all nodes in lazy way. But less than ideal performance wise.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/libhandlegraph/issues/33?email_source=notifications&email_token=AABDQEPMWJ5WWIVXTP4EG4TQU3D7JA5CNFSM4JGEY6WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3DV5I#issuecomment-557202165, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIFRUK5CBW72HTVK6TQU3D7JANCNFSM4JGEY6WA .
@ekg no because you can't jump yield
from within the lamda generator style, at least not in python.
So you end up having to either collect all the nodes so you can iterate over them python style at humongous memory costs. Or one has a silly iterator implementation like above.
So I came by today to open this same issue again :) it would be really nice to have std::iterator for edges,handles and paths. I think these iterators can be basic forward_only types
Would it be sufficient if we implemented something like the scan_path
method for edges?
Yes, I really think so :100:
On Thu, Jun 23, 2022, 18:04 Jordan Eizenga @.***> wrote:
Would it be sufficient if we implemented something like the scan_path method for edges?
— Reply to this email directly, view it on GitHub https://github.com/vgteam/libhandlegraph/issues/33#issuecomment-1164600552, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHQYFO4YQYKIY4ZTGQBEKDVQSDKRANCNFSM4JGEY6WA . You are receiving this because you authored the thread.Message ID: @.***>
It's a tricky thing, because we haven't exposed an object that has knowledge about where it is in the container the way an iterator does (which the exception of step_handle_t
). One easy option would be to copy the handles into a vector
and then iterate over that. I think the biggest downside is that it would be pretty memory inefficient for any for-each loop that looped over a sizeable fraction of the graph (e.g. for_each_handle()
, which does all nodes).
Thoughts?
I know. Copying and then iterating is not an option beyond test cases. I really need something new here, sorry. I have one more horrid alternative and that is to spin up a thread and use a small shared queue to load items in a buffer and iterate over that. Still we can't use raii there and will need to free the copied objects.
In the case of handles (and other stuff, probably), xg and PackedGraph store them in arrays. So a get_ith_handle() type method would be possible and efficienit (as would, by extension, supporting iterators that count i internally). I don't think you could get random access from HashGraph without some re-engineering, but I guess it could in theory expose its STL iterator (or some kind of wrapper for it).
I just need a STL style iterator, no need for fast random access.
I made a quick prototype of what a hack-y vector-based implementation could look like. Again, I have some concerns about efficiency for iterating over the entire graph, but this would probably work fine for traversing edges:
The current handlegraph api, uses the internal iterator approach (e.g.
for_each_edge
etc.) This is very inconvenient when trying to use this code from python. The only way to move data from inside the iteratee to the outside is by running the iteration in a different thread than the major python code and pass out the required information using a shared blocking queue. As you can imagine this is a pain to program.The python approach, like modern java is to use generators/streams instead. This works well in these languages because memory ownership is well defined. For C++ I imagine this approach was taken to make it easier to keep track of who owns what memory. Unfortunately, the push approach inherent in this solution is a pain to use from languages which like to use a pull model.
Could we have standard C++ iterators for each iteratee method?