Feature Request: Walk method that allows calling code to control block loading

ipld / go-ipld-prime

Golang interfaces for the IPLD Data Model, with core Codecs included, IPLD Schemas support, and some handy functional transforms tools.

MIT License

133 stars 50 forks source link

Feature Request: Walk method that allows calling code to control block loading #213

Open hannahhoward opened 3 years ago

hannahhoward commented 3 years ago

Hello!

For go graphsync, we need to be able to pause a selector traversal at a block boundary and come back to it later. Currently, we deal with this by severely messing with the loader to essentially have it hang until the request is unpaused. This requires an extra go-routine and quite a bit of finagling.

It would be really awesome to have a version of walking that had an interface like this:

type TraversalController interface {
 // determine if the traversal is done, and if so, what error state it ended up in
 IsComplete() (bool, error)
 // get data about the next link to be loaded
 CurrentRequest() (ipld.Link, ipld.LinkContext error)
 // supply the io.Reader as the block stream for the next link, and advance the traversal to the next block boundary
 Advance(io.Reader)
 // Supply an error for the next block resolution, either causing the traversal to terminate or skip going down
 // the path for this link if you supply traversal.SkipMe
 Error(err error) 
}

func WalkControlled(n ipld.Node, s selector.Selector, fn AdvVisitFn) TraversalController {
...
}

You can see how we hack around this with an extra go routine in go-graphsync here: https://github.com/ipfs/go-graphsync/blob/master/ipldutil/traverser.go

And you can see I sketched out beginnings of an implementation here: https://github.com/ipld/go-walker/blob/main/pkg/controlled/controlled.go

The biggest challenge is you essentially have to maintain a psuedo stack frame of nodes traversed.

Also, the interface for TraversalController is largely designed around the limitations of the current WalkAdv() implementation and it may be possible to make it much simpler -- I'm not wedded to it.

mvdan commented 3 years ago

Could you please expand a little on why the current mechanism with an extra goroutine is undesirable? I think it is the fairly standard way to stop I/O processes in Go. For example, if you supply an io.Reader and want it to stop for a minute, you'd essentially make the Read goroutine block for a minute before returning.

The biggest challenge is you essentially have to maintain a psuedo stack frame of nodes traversed.

This is what makes me think whether adding this sort of API is really a net benefit - you're essentially trying to mimic what a goroutine would already do for you, with its stack :)

I think what a goroutine really can't get you is pausing a walk, storing the state somewhere (e.g. disk), and resuming later. If the state is in a goroutine's stack, it's nearly impossible to save and restore it, but it's possible if you store the stack yourself.

hannahhoward commented 3 years ago

A couple reasons:

go routines invariably introduce complexity and race conditions. I'm basically moving to my own programming model of never using them except when the actual goal is to do something concurrently. For example, the traverser in graphsync is one I just discovered a new potential race condition in, despite working and maintaining it for almost a year. Maybe I'm just a terrible programmer, but I'm pretty sure go's concurrency stuff really isn't that effective for avoiding races when using a normal human sized brain. It's also by and large harder to test.
I would rather know what state I am maintaining than trust the go routine to manage the stack and just wonder what's inside. This is particularly important when effectively executing a remote query on someone's behalf, which is what go-graphsync does-- so when the query is paused, I would like not to pay the 2K penalty plus whatever other overhead there is from both the stack plus whatever heap references are maintained. (We still need to timeout these pauses at some point as well. Paused requests in go-graphsync are currently a hidden DOS vector that I'm somewhat worried about . Also, one way to mitigate would be to put stuff on disk -- which we can't do with a go routine as you said)

mvdan commented 3 years ago

1 - I agree that goroutines add complexity and potential races. That said, turning a relatively simple "Traverse" API into a "TraversalController" interface also adds considerable complexity of its own :) You perhaps reduce the chances of data races, but you add other kinds of footguns like not calling the methods in the right order, or using a controller outside its lifetime, or otherwise breaking the contract of its API.

I do agree with trying to not use extra goroutines in general, for what it's worth. But they are a fine tool if all you want to do is temporarily halt a blocking call. The alternative is to do what you'd do in NodeJS, with "stepping" APIs that make heavy use of callbacks. I don't particularly find that to have less footguns :) Unless what you want is to be in full control of the state and store/restore it, as mentioned before.

2 - Being in control of the traversal state, and being able to push it to disk, is what feels like a compelling argument to me. If that's the main driver behind the new API, I would agree with it.

That said, I don't think we should be prematurely worrying about the overhead of a goroutine. Separately storing the stack-like state in memory, or saving/restoring the state, would also have its own overhead. Unless you are keeping close to a million traversals blocked at a time, I don't think the goroutine stack's overhead will matter much.

mvdan commented 3 years ago

Maybe I'm just a terrible programmer, but I'm pretty sure go's concurrency stuff really isn't that effective for avoiding races when using a normal human sized brain. It's also by and large harder to test.

For the sake of completeness, I assume you're aware of the data race detector :) It's by far the most useful tool against concurrency mistakes when using multiple goroutines. It only covers memory data races, not logic races, but still. I seem to recall that Lotus might not be able to use the data race detector due to its hard-coded goroutine limit, but that's being worked on :)

hannahhoward commented 3 years ago

We're mostly worried about logic races, not data races. It is a source of constant bugs in deal making.

hannahhoward commented 3 years ago

@mvdan one additional clarification:

turning a relatively simple "Traverse" API into a "TraversalController" interface also adds considerable complexity of its own

I am not proposing removing the existing API -- I'm suggesting adding a second.

You perhaps reduce the chances of data races, but you add other kinds of footguns like not calling the methods in the right order, or using a controller outside its lifetime, or otherwise breaking the contract of its API.

yea, the above interface is just what I've implemented so far, based on not having access to go-ipld-prime internals. I believe you can make it significantly simpler, and reduce possibilities of using it in the wrong order.

one way to reduce foot guns would be to make it functional

type Traversal interface {
 Request() (ipld.Link, ipld.LinkContext, error)
 Advance(io.Reader) (isComplete bool, next Traversal, err error)
 Error(err error) (isComplete bool, next Traversal, err error)
}

I'm not clear you even need an Error function since you have controller of the execution, so you can just return it directly. Then you'd have a Skip() function which is way more natural than the current practice of returning traversal.SkipMe as an error from the loader.

mvdan commented 3 years ago

I am not proposing removing the existing API -- I'm suggesting adding a second.

I thought of that too, shortly after writing my replies above. I think that would be a good compromise. If someone wants the more advanced control - either to avoid extra goroutines, or to store the state to disk - they could drop down one level.

We're mostly worried about logic races, not data races. It is a source of constant bugs in deal making.

I feel your pain. I'm not entirely convinced that replacing one kind of footgun with others is worth it here, but I'm also not an expert with Lotus deals and IPLD traversals, so I'll ultimately defer that to you and Eric :)

warpfork commented 3 years ago

Guess I should weigh in briefly on this:

Am sharing the moderate skepticism that building one's "stack" in application level is gonna be better than using the natural stack. (I've... written a handful of libraries that take this tactic, and without getting into details, I'd say holistically, I usually regret it. While it's true that technically, one gains the ability to have more detailed control... it ends up being a lot of work to get up to as good as the natural stack. Easy to underestimate this, IME.)
But if it comes as a new API? Yeah, sure. Absolutely. Why not? We've got a whole package's worth of namespace for this; let's use it freely.
"PRs welcome" :) I'm unlikely to prioritize this personally in the immediate future.