ipfs / go-graphsync

Initial Implementation Of GraphSync Wire Protocol
Other
100 stars 38 forks source link

Optimized selector traversal over graphsync #253

Open hannahhoward opened 2 years ago

hannahhoward commented 2 years ago

What

What we'd ultimately like to do is prevent duplicate selector traversals of the same block for certainly types of IPLD graph traversals. While the logic to detect "duplicates" is quite complicated and will evolve over time (our very first implementation is to special case just the traverse-all selector), we need a way to communicate about it over the network.

Rather than try to communicate about how to perform optimizations between peers, we're going to allow each side to optimize according to their OWN best interest and let the other know what they did. Most likely responders will almost always turn this on cause it minimizes their processing time. For client requests, we'll probably make it configurable when you initialize graphsync or perhaps at the request level as well. As we evolve our logic for detecting duplicate traversals, this will enable us to upgrade only one side and still gracefully finish a request.

How

currently, the schema for metadata about links traversed is as follows, encoded in the extension `graphsync/response-metadata:

type LinkMetadata struct {
  link Cid
  blockPresent Bool
}

type ResponseMetadata [LinkMetadata]

This ticket proposes a graphsync/metadata/1.1 extension of this format:

type BlockStatus enum {
  | Present  ("0")
  | Absent   ("1")
  | DuplicateTraversal ("2")
} representation int

type LinkMetadata struct {
  link Cid
  blockStatus BlockStatus
}

type ResponseMetadata [LinkMetadata]

When a responder sends DuplicateTraversal, it is telling the requestor that they their own traversal skipped this link, even though they have this block, because they believe it to be a duplicate. The requestor can expect no blocks or links to be sent for this part of the tree, whether or not the responder is telling the truth that it's a duplicate.

When a requestor receives this message, how it proceeds is up to the local configuration around duplicate traversals, and whether the traversal is indeed a duplicate according to local logic for detecting duplicate traversals. If it does proceed down the path, it can be assured that till it returns to that part of the traversal, all blocks loads will be local based on data previously received. If there is no local block, the requestor should not wait for a remote block.

Backwards Compatibility

It's safe to assume many requestors will support only the legacy metadata format for some time, but responders may still want to limit duplicate traversals for requestors supporting the new format.

With this in mind, I think we had better add a requestor side extension graphsync/response-metadata-version:

type Version int

And those that send the extension with a 1.1 are telling the responder ahead of time they support metadata 1.1

Alternatively, we know we don't want metadata to be an extension at all but rather part of the core protoocol, so maybe we should fold this into graphsync 1.1 encoded as CBOR (https://github.com/ipfs/go-graphsync/issues/88, https://github.com/ipld/specs/pull/354, https://github.com/ipld/specs/pull/355)

This is a meta-issue that will lead to implementation tickets once I have sign-off from @warpfork and others

hannahhoward commented 2 years ago

Note re labels -- I believe this will be intermediate once an implementation path is well defined, but for now it's blocked pending approval of the overall strategy

rvagg commented 2 years ago

This approach seems fine to me, although "DuplicateTraversal" may be too specific, unless you're considering adding further specific reasons to the enum. Some examples (pending work on selectors upstream):

These could all be separate reasons, and maybe that's OK and we're anticipating expanding the enum? But if we just want to send a "I have good reasons not to send you this block and it's not a failure", then the name should probably be changed.