ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.01k stars 3k forks source link

Fast (parallel) Traversal For A DAG That Stops At Arbitrary Points #5487

Open hannahhoward opened 5 years ago

hannahhoward commented 5 years ago

Summary

Given a root CID, I would like to traverse that node's DAG quickly (making requests for children in parallel), and stop traversal of links on a per node basis using logic I provide.

Use Case

The time to list large sharded directories is prohibitive because link traversal in the DAG is serialized. We need to speed this up. (#4908 )

Requirements / Acceptance Criteria

Requirements:

First Steps

Additional Optimizations

Not Included

schomatis commented 5 years ago

Given a root CID, I would like to traverse that node's DAG quickly (making requests for children in parallel), and stop traversal of links on a per node basis using logic I provide.

  • Leave what to do with each node up to caller (VisitNode)

Get nodes in parallel (GetMany)

Hey @hannahhoward, I'm currently working on a PR (https://github.com/ipfs/go-ipld-format/pull/39) that has many similarities with the requirements you're describing but it still has a long way to go to meet all of them. It's basically a structure to walk DAGs providing a Visitor function to let the user lay some logic on top of it, it was born out of some code extracted from the UnixFS DAG reader.

If you have the time it would be great if you could take a look at it and tell me what you think it. I'm always interested in finding use cases besides the original motivation for it that was decoupling some of the DAG reader logic (https://github.com/ipfs/go-unixfs/pull/12/files), since I would like to arrive at an API that may be useful in more cases than just that one.

schomatis commented 5 years ago

Also, welcomed to the team! :tada:

hannahhoward commented 5 years ago

@schomatis thank you for the pointer -- I need to review it more in full and will give more feedback shortly. Have you also seen the code here: https://github.com/ipfs/go-merkledag/blob/master/merkledag.go#L363

This code specifically deals with (abstract? - can't tell) DAG traversal but seems to have good parallelism -- though I think without the level of control over how traversal is done in your code. I'm trying to figure out what needs to be used where.

schomatis commented 5 years ago

Yes, EnumerateChildrenAsyncDepth has the distinct advantage that it doesn't traverse in order, but that could be added to the DAG walker, actually I wanted to add something like this to replace the way gx traverses a dependency graph to download the packages, since Gx dependencies are basically a DAG of packages (that could be abstracted as NavigableNode).

Anyway, there are many parts in the code base where similar DAG traversing logic is taking place and the idea would be to abstract them in a single API (where reasonable) to simplify the code.

hannahhoward commented 5 years ago

Yea I need to leave a comment on your PR -- cause I've review the code and it's really awesome. @eingenito + @michaelavila think so too. For the in-order case, and also for general tree walking it's great.

The big challenge here is we're just trying to get nodes as fast as possible in the use case of LS -- which requires making requests for a nodes links as soon as we have them and in parallel as much as possible -- effectively we want to be walking the tree in multiple directions in an effort to discover as many blocks to fetch as we can as quickly as possible. And then if order becomes a requirement, deal with that through sorting later. Anyway, I will put a comment on your PR and I think there is still potential to incorporate it for this use case in the future.

schomatis commented 5 years ago

Yes, agreed, actually @hsanjuan had already proposed something like this in https://github.com/ipfs/go-ipfs/pull/5257#issuecomment-414396172, I'll open an issue about it.

Stebalien commented 5 years ago

@hannahhoward this is moot now, right?