ipld / go-ipld-prime

Golang interfaces for the IPLD Data Model, with core Codecs included, IPLD Schemas support, and some handy functional transforms tools.
MIT License
133 stars 50 forks source link

Controlled Traversals #303

Open hannahhoward opened 2 years ago

hannahhoward commented 2 years ago

Graphsync uses selectors but is primarily focused on the block layer -- all of it's communication is based on intercepting BlockReadOpener functions. Moreover, Graphsync wants to be able to stop a selector traversal in the middle and pick it up later. This is needed to support pausing and resuming requests.

Currently, the traversal API is a blocking call with a callback that only returns assembled IPLD nodes.

Inside of the go-graphsync code base, we have a large and complicated module that wraps a traversal in a go routine in order to provide much finer grained control on the block level: https://github.com/ipfs/go-graphsync/blob/main/ipldutil/traverser.go

It would be much more ideal if go-ipld-prime could provide a non-blocking call that returns the same controller interface:

// Traverser is an interface for performing a selector traversal that operates iteratively --
// it stops and waits for a manual load every time a block boundary is encountered
type Traverser interface {
    // IsComplete returns the completion state (boolean) and if so, the final error result from IPLD
    IsComplete() (bool, error)
    // Current request returns the current link waiting to be loaded
    CurrentRequest() (ipld.Link, ipld.LinkContext)
    // Advance advances the traversal successfully by supplying the given reader as the result of the next IPLD load
    Advance(reader io.Reader) error
    // Error errors the traversal by returning the given error as the result of the next IPLD load
    Error(err error)
    // Shutdown cancels the traversal
    Shutdown(ctx context.Context)
    // NBlocksTraversed returns the number of blocks successfully traversed
    NBlocksTraversed() int
}

// retuns immediately, advances only on calls to the traverser
func (prog progress) WalkControlled(n datamodel.Node, s selector.Selector, fn AdvVisitFn) (Traverser, error) {
  ...
}

A initial version could simply pull over the graphsync code. Perhaps later optimizations could remove the go-routine entirely.

I'm not sure if this is the right interface exactly for inside ipld-prime. The goal here is to illustrate the problems and needs of Graphsync and orient us towards building a better interface.

willscott commented 2 years ago

noting that Advance on a Traverser will get hard if there's an ADL involved

hannahhoward commented 2 years ago

the current graphsync traverser operates entirely on intercepting the BlockReadOpener, and based on your PR should not be affected. That said, I wonder if we actually want to think about this as a more low level utility:

// Controller is an interface that provides imperative control over block loading
type Controller interface {
    // Current request returns the current link waiting to be loaded
    CurrentRequest() (ipld.Link, ipld.LinkContext)
    // Advance advances the traversal successfully by supplying the given reader as the result of the next IPLD load
    Advance(reader io.Reader) error
    // Error errors the traversal by returning the given error as the result of the next IPLD load
    Error(err error)
    // Shutdown cancels the traversal
    Shutdown(ctx context.Context)
    // NBlocksLoaded returns the number of blocks successfully loaded
    NBlocksLoaded() int
}

func NewControlledBlockReadOpener() Controller, BlockReadOpener {
   ...
}
rvagg commented 2 years ago

^ indeed that

Because, the majority of our uses of selectors in the real world are simply about watching block loads. A fact I've been reflecting a lot on lately (currently looking at selector-like stuff in JS with a similar eye). We have really one practical case currently where it's about navigating to a node which is path-as-selector (https://github.com/ipld/go-ipld-selector-text-lite - but even that's really trying to get at the top of a single block, not an arbitrary node, and nothing fancy at all). The rest are, "here's a root, run an explore-all and let me watch what you load so I can do something with those blocks in that specific order". And graphsync is one of those.

So, while we might make theoretical cases for selectors-over-ADLs, we don't yet have real use-cases for them and we need to make sure we serve the majority case of "just let me watch your block loads" and in graphsync's case, "just let me watch your block loads and let me control the advancement from one block to the next". So in API terms, a developer needs to be able to view selectors in those terms, even if it's also possible to view it from the other side of "I have a selector that gets me to a node / some nodes in a graph that may or may not involve ADLs, I just want you to get me to that node / those nodes".

willscott commented 2 years ago

We do have a case for selectors over ADLs still, which is "give me the blocks that make up this hamt" or more importantly "give me the blocks that make up this directory" which is not something we can express without the unixfs ADL

mvdan commented 2 years ago

Let me see if I'm up to speed: right now, with BlockReadOpener, you have a callback for every block being read. This is enough to pause a traversal, which can last for a while if a separate goroutine is used.

Presumably, the callback also allows you to stop the traversal entirely, because you could return a sentinel error signaling that intent. If this doesn't exist right now, we could add it, similar to https://github.com/golang/go/issues/47209.

Can you help me understand why imperative methods like Advance, Error, and Shutdown would be better from the point of view of graphsync? I can imagine them being easier to use depending on what you need to do, but as long as blocking one goroutine isn't a problem, it seems like either method is equally powerful to me.

CurrentRequest and NBlocksLoaded seem to be about querying the current state of the traversal. Isn't that kind of thing what https://pkg.go.dev/github.com/ipld/go-ipld-prime/traversal#Progress is for? If we need extra information or query methods there, I imagine they could be added as well.

mvdan commented 2 years ago

Having said the above, I'm not opposed to an "imperative traversal" layer as a wrapper for ease of use, if that can be useful for some cases like graphsync. At the same time, I'd prefer not to duplicate API if we can avoid it - go-ipld-prime has enough APIs as it is :) Please let me know if I've misunderstood anything or lack part of the context here.

warpfork commented 2 years ago

This is roughly the same ask as https://github.com/ipld/go-ipld-prime/issues/213, no?

warpfork commented 2 years ago

+100 to Will's comment that anything we do should be ready to deal with ADLs. We want Selectors-over-ADLs now, because that's what's behind very practical, very real, user-facing stories like "use selectors to ask for all the blocks needed to get a file out of a unixfsv1 directory". (We even have this already, at least infrastructurally; it's mostly plumbing it all the way to users that may have missing pieces, afaict.)

mvdan commented 2 years ago

Worth noting that, with https://github.com/ipfs/go-graphsync/pull/300, the layer that graphsync has above an ipld-prime traversal is significantly easier to understand. In paticular, there are no longer multiple channels at play.

I'm going to unassign myself from here for now, because it doesn't seem like it's an urgent need when compared to some of the other roadmap tasks.

One mental note from a recent chat about this with @hannahhoward: this API would be particularly useful if the traversal state can be serialized and restored. For example, if a client disconnects in the middle of a traversal, one can imagine a server serializing the traversal state and storing it in a file or database. When the client reconnects, the server could load the state and continue where it left off.

If serializing and restoring states isn't something that ipld-prime can support, then it makes little sense for us to support imperative traversals directly - because one could just implement them on top of the current traversal API, which stores the state in the stack. And a stack is that, after all - a state that lives in memory and cannot be stored or restored.