GraphSync: Data Deduplication Across Simultaneous Queries

Overview

There are a few important cases where it would be very useful to have multiple queries that access similar data be bunched together. These can be handled at the either at the Selectors layer or at the GraphSync layer. Given both my interest in data deduplication, as opposed to query traversal, and the concerns with CID selectors raised in #116 I propose handling this at the GraphSync layer.

Useful Examples

To understand the utility of handling a query of these simultaneous queries let's look at two examples:

Resuming Interrupted Queries

We ask for the full graph starting at Root:

       Root
   /    |    \ 
  A0   B0   C0
  |     |     |
  A1   ...   ...
  |
  A2

We then get interrupted in our DFS resolution of the graph and only receive nodes Root, A0, A1. If we want to resume our session, whether with the same peer or a different one, we would like to not get repeats of the nodes we already have. There are a variety of ways to describe how to resume our sync, they are of the form: I want the graph rooted at Root and I

Have first 3 nodes of the query
Have nodes Root, A0, A1 (or some condensed form like A1 and all parents)
Want to start with A2, B0, C0

While option 1 is the most efficient it requires both peers doing the traversals in the same order which we might not be prepared to guarantee - something I think has already been looked into and that I'm probably just lacking context for. Options 2 and 3 are similar as they are essentially inverses.

Option 3 would be greatly helped by allowing data deduplication across queries/queries being grouped together.

Versioned Data (Git, Operational Transform, CRDTs, ....)

We start with a graph G at state G0, and then created two modified versions of the graph G1 and G2 both of which reference the previous state, G0. We then give a friend the CIDs to the latest versions of G (which as far as we know are G1 and G2). The friend then uses GraphSync to ask us (or some other peer) for the full graphs under G1 and G2. We would like to not get the data under G0 more than once.

G1      G2
  \    /
    G0

How we might handle this

Fundamentally what's being asked for is that GraphSync retain the list of sent blocks per peer longer. The balancing act required in the solution is how we support this while not allowing a single peer to eat up all of a GraphSync responder's resources.

Send Bundled Requests

Allow sending multiple requests together such that GraphSync knows not to GC its list of sent blocks per peer. The protocol already allows for sending multiple requests together so all we could easily support this by just A) holding off on the GC until the full multi-request is complete or B) add a flag saying that for this particular multi-request we should hold off the GC until the multi-request is complete.

Allow Append Requests

All subrequests are still processed individually, but the requests can have a field tying those requests to other on-going requests for GC purposes.

Make sure to look at https://github.com/ipfs/notes/issues/272#issuecomment-341996490 and https://github.com/ipfs/notes/issues/272#issuecomment-345897231. Those proposals tried to address this by ensuring that every node touched got sent back to the client.

However, you bring up a very real problem: we need to be able to sync things like git repos.

How we might handle this

I'm not sure if either of these solutions are sufficient. I need to be able to say "I already have everything from commit Y back, only give me nodes introduced between Head and Y". I may have fetched Y ages ago or from another peer.

IIRC, git does this by keeping local metadata. I wonder if we need to do the same.

@Stebalien @hannahhoward @warpfork correct me if I'm wrong here, but we are already planning to use the Recursive selector to implement

"I already have everything from commit Y back, only give me nodes introduced between Head and Y"

The problem becomes when I want to say I have the heads of branches A,B,C and the latest commits in each of these nodes are A0,B0,C0 where maybe B0 points to C0 and it's possible all 3 of these branches merged in between A0,B0,C0 and A, B, C.

If we use the Send Bundled Request solution then we send something like [RecursiveSelect(A to A0), RecursiveSelect(B to B0), RecursiveSelect(C to C0)]. This will get us back the right data, however if we don't use any compound query smarts and we're unlucky we could send back all the nodes from C to C0, including the nodes from B0 to C0, before we start any processing on B and noticing the overlap.

If we use the Allow Append Requests option then we send RecursiveSelect(A to A0) (which has query ID QA), [RecursiveSelect(B to B0), attach to QA], [RecursiveSelect(C to C0), attach to QB]. Alternatively we could create an ID for this particular message group QG and use that. As we start to receive nodes in the C branch that we already have such as B0 we send a cancel command that tells GraphSync to stop sending B0 or its subnodes.

You might notice that the "cancel" commands here are effectively the same as the sub/children selectors since it the command ends up being an unSelector/a selector describing nodes we don't want. We could certainly limit the complexity of these unselectors to be basically just "I have this subnode". However, I definitely see the temptations to make arbitrary selectors work here and I would bet that folks, such as @warpfork, would have serious concerns about the possibly NP-complete problems we would then be putting on GraphSync responders.

Alternatively, there's a perhaps simpler (if more verbose and costly) solution. If we use BFS instead of DFS on the selectors and we notice that that we already have B0, a node in Cs history, we wait until we are sent one of B0s predecessor nodes (let's call it B1). Then because we are using BFS and have received B1 we are guaranteed to have all of B0s siblings that are in Cs history, let's call them S0, S1,.... We can then send a multi-request {[RecursiveSelect(S0 to C0), attach to QC], [RecursiveSelect(S1 to C0), attach to QC],...Cancel(Request C)}

Since we already have the ability to cancel GraphSync requests this means the only new part of this scheme is the grouping of requests across multiple messages.

There's more going on here than I can immediately mentally unpack and fully engage with, but there's a couple quick factual bits I can toss in...

Git does this with wantlists and havelists and the reconciliation between them is not particularly free. Simply exchanging those messages is O(n) in the wantlength and the havelength. Git does not know anything about branches and lineages when syncing. Here are implementations of this in a go-git project: wants, haves.

Git is great, but it's not necessarily correct to assume it's going to scale awesomely if you took all content in all repos ever and hucked them together in one big content-addressable store with no additional protocol improvements.

In general, these discussions need O's and n's attached to them or it's very hard to get anywhere.

Option 3 would be greatly helped by allowing data deduplication across queries/queries being grouped together.

That's kind of what I had in mind in the GraphSync (A) proposal. That GraphSync itself has enough state locally to resume queries.

@vmx yes and even with the existing GraphSync proposal there is enough information to be able to resume queries from some application level above GraphSync. Unfortunately, if we don't make some change to GraphSync (like one of the two mentioned above) the application level resuming won't be able to be as efficient as it could be.

@warpfork I agree with your point about wanting more asymptotic quantification of what we're gaining/losing with particular changes and algorithms. Due to most of these algorithms/strategies having performance highly dependent on DAG shape and pre-existing state (e.g. what data a GraphSync client already has) as well as networking factors like latency, I suspect anything we come up with is not going to look like O(n) but instead will instead be a function of many variables.

Let's look at a couple cases where current GraphSync will get us our data, but be more expensive then if it were tweaked with one of the mechanisms above:

Versioned Data/Resuming Interrupted Queries Example

If we take the graph from Resuming Interrupted Queries and assert that all h of the heads (e.g. A2, B0, C0, ...) all point to a graph G with n nodes then, in the case of unlucky garbage collection, we end up with O(hn) nodes if we don't send any cancellations. Alternatively, if l nodes can be sent from the resolver to the requester during a single round-trip time then the requester receives O(lh + h) nodes if we always send cancellations.

However, if we utilize either of the above proposals the requester ends up with only O(h) nodes.

Pathological highly overlapping graph

We have a DAG D where each node has b+1 children where b of the children point into a subgraph G and the last child has b+1 children that follow the same pattern (unless it's the bottom of the DAG and it has 0 children).

We initiated two GraphSyncs together, one for G and the other for D. We received G in its entirety before starting to receive anything from D.

Currently: We will not store any references to G as we're sending D since they are garbage collected once the G request is finished. Therefore, as we use DFS to walk down D, every time the client receives the root of G it sends a cancel request followed by requests for all of the siblings of that Gs parent node. This is b + Sum_(i=1...b) i= O(b^2) cancel requests + regular requests per node containing new information. For D with depth d (of nodes not in G) this means O(d*b^2) requests are made. Additionally if l nodes can be sent from the resolver to the requester during a single round-trip time then an additional O(d*l) nodes from G that the requester already has are resent. If G has many more nodes than d(b^2+l) then while we've wasted resources it's still more efficient then if we hadn't sent any cancellations and new requests. However, if not then our premature optimization has in fact made things worse.

Either of the above proposals: We store references to G even as we're walking down D since the two are bound together. This results in none of the excess O(d*[b^2+l]) messages from before.

What's missing: If instead of retrieving G and D from the same peer I ask for D without G (e.g. I already have it, I received it from an interrupted query for G + D, I'm asking for it from another peer, etc,) I still end up with the excess messages. This stems from our lack of a havelist and our reliance on cancel + rerequest as a solution here.

Conclusion

Having any way at all for the requester to inject information into the GraphSync process running on the resolver could help reduce extraneous messages. However, I'm a bit concerned about how deep the rabbit hole on using selectors for this could be. Therefore, I'm suggesting that maybe we can start by just using one or two simple mechanisms that can potentially save a lot of resources.

ipld / specs