goodboy / tractor

A distributed, structured concurrent runtime for Python (and friends)
GNU Affero General Public License v3.0
263 stars 12 forks source link

How to do (next gen, SC oriented) remote error serialization for cross host/process propagation? #5

Open goodboy opened 6 years ago

goodboy commented 6 years ago

It's like the sloppiest and laziest thing atm..

Doesn't rpyc have some fancy way it does this. Seems like there's a homegrown traceback serializer. Here's their theory of operation.

I specifically don't want to go down the proxy route (one of tractor's tenets) but I think for exceptions it's a special case.

goodboy commented 6 years ago

More hints from celery on exception pickling?

goodboy commented 4 years ago

Thanks to @njsmith for pointing out the traceback serializers in jinga2 and also @dhirschfeld for pointing out tblib which seems to be derived from it.

ryanhiebert commented 4 years ago

I think that we want to have error propagation that somehow includes explicit mention of host boundaries in a readable way. That's probably the relatively easy part of the issue, but I don't want it to be overlooked. The traceback should show when the host/process changes, which means that the code itself may be different. If there were some way to give a good representation of which version of the code it found on the other side as well, that seems really excellent.

Of course, if everything really is on the same version of the same code, then it'll be redundant. And we could, when things are happy, potentially unseralize the exceptions and raise them as more than just a wrapper exception with the traceback from the other server, which would be cool. Not something that we can rely on in all cases though, so we have to have a good fallback when things don't match and the exception and traceback don't propagate well.

This is all really interesting, and I don't know what I'm talking about, but it looks very neat.

goodboy commented 4 years ago

error propagation that somehow includes explicit mention of host boundaries in a readable way

I think this should mostly be included in a mailbox / address in every message (the actor model way). Right now tractor is kind of doing this by having each portal aware of the far end address on either side. I'm trying to think of whether it matters if an actor is local to the host or remote - maybe just for certain network-comms related error handling? A lot of this will be delegated to lower layers in tractor (depending on IPC transport - TCP versus NNG etc.) so I guess relying on specific error types that might change across versions might pose a problem? I feel like a decently designed exception inheritance tree should mostly cover this?

If there were some way to give a good representation of which version of the code it found on the other side as well, that seems really excellent.

In my mind the the primary code that cares about remote errors is an actor's supervisor, and I wonder, should a super care about what version of the code is being run? Is this maybe the concern of something else? For example, if the application required that info couldn't the parent just immediately ask for a version from its child just after spawning? When will it be useful to a super in the general case to know about its child's code version? Maybe in a system where there is hot code swapping like in erlang? I'm still not even sure if such a feature should be built into the core of tractor - might be better oriented as a small "native app" on top?

I think the main question is how much does a remote super need to know about a child's error types / internal code. To me, too much coupling here would mean the super is more part of the app then part of distributed computing system - which maybe is fine in some cases but then won't the super need to have special consideration for details of the child anyway? At face value it would seem to me a super needs to know as much about a child's remote errors as a try/except block needs to know about code it calls (that may change in future revisions). The except: blocks here can be many, specific, and as nested as desired?

If a super is supposed to fulfill its conventional role then I think some set of error "classes" might be necessary to help (custom) supervisor authors determine what types of failure recovery (or cancellation) logic is available. Having a set of contracts for what errors should be raised in which situation is something that can be iterated over time if designed right - but still there will be a foreseeable super handlers-to-error types compatibility problem over multiple versions running in the same cluster(s).

Anyway, too many new questions :smirk_cat:!

The short assertion is that we already do pack task info in the exception msg and announce / pack the actor uid in the RemoteActorError on the receiving side.

I don't at the moment see any problem with requiring all such remote errors to include the address/actor uid/ task uid info in every error. It's probably just going to make logging system integration that much easier and useful. I also don't see a problem with reconstructing remote errors into local objects other then performance.

goodboy commented 11 months ago

Heh, so we're already kinda requiring the whole uid-in-error-as-msg bit as part of the soon to #357 land and we might as well use the new multi-address support we're experimenting with in #367.

Addresses in every error seems like a handy thing for unwinding complex inter-actor-tree service failures especially if we ever get to multi-host supervision APIs down the road ..