RFC: GraphSync integration

hannahhoward commented 5 years ago

Context

What is GraphSync?

GraphSync is a protocol to synchronize IPLD graphs across peers. The full specification can be found here: https://github.com/ipld/specs/blob/master/graphsync/graphsync.md

Where Bitswap deals with requesting individual blocks from other peers, Graphsync assumes that the data you're interested in is node or nodes in an IPLD graph, also allows you to make requests to remote peers to return the results querying the graph using an IPLD Selector

There are several potentially several use cases for requesting data this way, but perhaps the simplest to understand is attempting to get a single node at a deeply nested path. Using only Bitswap, getting to a deeply nested path means several roundtrips of making a query for a block, receiving it, following a link, then requesting the next block, and repeating until you reach the final link in the path. With Graphsync, you could make a query to a remote node using a path selector, and have that node perform the traversal of the path locally, then send you all the blocks in the path back at once, in a single roundtrip (it has to send all the blocks in the path so you're able to verify the results locally from the block you already have)

Status of go-graphsync

go-graphsync is the initial implementation of GraphSync in go. It is nearing alpha 'feature complete' status as of April 2019, and will be ready for use around the beginning of May. go-graphsync was initially written to support use cases in Filecoin. Filecoin's integration however is not likely to begin until circa Q3 2019.

Important Caveat: The initial implementation of go-graphsync is entirely single peer to single peer -- a graphsync request is made directly to only one peer at a time. It assumes that you've already found providers for the graph. It assumes you know the peer you are requesting from has the data you want, or that you will write the code to query multiple peers outside of go-graphsync.

Status of go-ipld-prime

go-ipld-prime is a complete rewrite of the go implementation of the IPLD specification that the IPLD team has been working on for some time. It is relevant to GraphSync because:

go-ipld-prime is the only implementation of IPLD in go that supports IPLD selectors
go-graphsync therefore relies on go-ipld-prime in its implementation and assumes the underlying node format for the DAG one is compatible with go-ipld-prime (relevant in particular because go-ipld-prime does not currently support nodes encoded in protobuf)

go-ipld-prime is a very very different looking library than go-ipld-format and switching all or parts of IPFS to use it is a potentially large task.

IPFS Use Cases For Graphsync

This RFC is in part to identify potential use cases for GraphSync

The most obvious use case for Graphsync is working with UnixFS directories. GraphSync provides an efficient method for requesting a deeply nested path in a UnixFS directory. With augmentation, it might provide a more efficient way to transfer entire UnixFS directories without being as many roundtrips to traverse the directory and request more nodes.

There are potential other use cases anywhere the data that IPFS works with is in a DAG structure. We can use discussion in this issue to identify some of these use cases.

Integration Path, Questions, Challenges

UnixFS & GraphSync

As stated before, GraphSync relies on go-ipld-prime which currently does not support nodes that use an internal protobuf serialization format. UnixFS, at least in its v1 implementation, uses protobufs for serializing nodes.

Therefore, to integrate GraphSync with UnixFS, we would either need to augment go-ipld-prime with protobuf support OR we would need to first complete UnixFS v2, which is intended to be based on go-ipld-prime directly. Given that supporting protobufs in go-ipld-prime is non-trivial and potentially quite challenging, and moreover that having UnixFS v2 complete would unlock a number of potential features, it seems like it would much easier to simply prioritize UnixFS V2 and wait on its completion to integrate GraphSync.

Independent CoreAPI GraphSync

While not as potentially useful to users of IPFS, or most importantly package managers, it would be useful for real world testing to be able to make GraphSync queries in the real IPFS network from the IPFS Core API or the command line. Enabling GraphSync in the CoreAPI would also allow people writing applications on top of IPFS to potentially experiment with how GraphSync might unlock new types of uses for IPFS.

The path to CoreAPI GraphSync integration is potentially much shorter than integration in UnixFS-- it simply requires agreeing to a specification for what CoreAPI function signatures would look like, and then implementing them. There are no obvious blockers for proceeding with CoreAPI integration once go-graphsync is feature complete

Future Integrations / Supporting Providers/DHT Work

The ability to query into a graph from a root node might pair well with efforts to experiment with different strategies for providing only some nodes in a DAG. However, this probably will need to wait for further work on provider strategies.

Stebalien commented 5 years ago

Therefore, to integrate GraphSync with UnixFS, we would either need to augment go-ipld-prime with protobuf support OR we would need to first complete UnixFS v2, which is intended to be based on go-ipld-prime directly. Given that supporting protobufs in go-ipld-prime is non-trivial and potentially quite challenging, and moreover that having UnixFS v2 complete would unlock a number of potential features, it seems like it would much easier to simply prioritize UnixFS V2 and wait on its completion to integrate GraphSync.

Supporting DagPB (protobuf nodes) should actually be pretty easy, unless I'm missing something. The real issue here is that UnixFSv1 stores another protobuf inside this DagPB node. From the perspective of IPLD, this inner protobuf is "just bytes". That means selectors aren't going to understand it in any meaningful way.

A good first approach may be to integrate go-ipld-prime into go-ipfs first (no selectors). That should make integrating go-graphsync easier.

Regardless, I'd like to prioritize UnixFSv2. Also, take a look at https://github.com/ipfs/unixfs-v2/issues/24.

Stebalien commented 5 years ago

The path to CoreAPI GraphSync integration is potentially much shorter than integration in UnixFS-- it simply requires agreeing to a specification for what CoreAPI function signatures would look like, and then implementing them. There are no obvious blockers for proceeding with CoreAPI integration once go-graphsync is feature complete

This sounds like an improved dag API, right? ipfs dag select .... Or were you thinking about something else?

hannahhoward commented 5 years ago

@Stebalien yes the DAG API you mention above is somewhat what I had in mind.

hannahhoward commented 5 years ago

And the value of an ipfs dag select is it allows us to potentially integrate graphsync before unixfs v2 / go-ipld-prime, which is a giant task.

musalbas commented 3 years ago

I was wondering if there has been any progress on this issue, and if there are any plans to integrate GraphSync into IPFS as a replacement to Bitswap.

We are using IPFS for block distribution, and in our protocol we use "data availability proofs" where nodes only needs to download chunks from blocks and their associated Merkle proofs: https://github.com/lazyledger/lazyledger-core/issues/35

As each node in the proof requires a round trip to traverse, Bitswap's latency is high for this. We have done some experiments and found that the average latency to do a data availability check on a block is about ~2 seconds: https://github.com/lazyledger/ipld-plugin-experiments/pull/9

Wondertan commented 3 years ago

@musalbas, the latency you have is due to the nature of NMT, which is a binary tree rather than a DAG. That is, IPFS fetches only 2 links per roundtrip with NMT, while IPFS DAG currently maxes out to 174 making it somewhat usable.

In LL's case, using Graphsync(GS) is crucial and luckily it already exists within IPFS, but obviously, that is not a full integration and migration to ipld-prime for IPFS is required first to operate over GS, as it relies on new IPLD and specifically featureful selectors. Therefore, LL's IPLD plugin has to migrate as well to support GS and importantly selectors for an ability to fetch e.g. the whole batch of data related to some namespace from Storage nodes without any roundtrips at all. Thus, I suggest you first redesign the current NMT integration with the new IPLD and then use GS directly even without waiting for it to be fully integrated into IPFS, like Lotus currently does. Thanks, PL team for the modular nature of their protocols :)

cc @liamsi

musalbas commented 3 years ago

Does the current support make use of IPFS peer discovery or do we have to manually input nodes to get them to exchange data over GraphSync? Note that "storage nodes" in the LazyLedger paper is a redundant concept and should actually be replaced with "IPFS", so that data is downloaded via IPFS instead of a specific storage node. IPFS itself is treated as a black-box "storage node".

Wondertan commented 3 years ago

@musalbas, yes, GS does not have any PeerRouting capabilities, and discovery has to be worked out on top of GS, but that should be relatively easy for basic cases with the use of the IPFS DHT independently.

BTW, what is your plan on incentivizing IPFS(Storage) nodes? I guess you want to make light clients(application users) only share data with each other then, but that is not reliable, and having a Storage Node role in high-level economics is necessary for data availability. If you don't have any, I can share lots of ideas to fill the gap :)

willscott commented 3 years ago

@musalbas There is some queued up work for bitswap/graphsync in IPFS that should help 😄

ipfs / kubo