pikers / piker

(e2e) foss trading for non-tinas
GNU Affero General Public License v3.0
103 stars 17 forks source link

`tractor` bug: serialzation error isn't propagated?! #244

Open goodboy opened 2 years ago

goodboy commented 2 years ago

Guilty line of code when using msgpack in tractor.

Super odd, and pretty sure @iamzoltan found this before but, basically a serialization error due to using msgpack (instead of msgspec) and sending a set causes a crash in tractor.Context.started() but this error isn't propagated upward to the calling actor (emsd in this case).

Instead we get the msg error but the contents of that message aren't an error and instead contain a return?

Dec 14 11:39:42 (emsd, 8459, _emsd_main)) [PDB] tractor._debug _debug.py:529
Attaching to pdb in crashed actor: ('emsd', 'e371c0eb-bf2f-4e79-942f-a5ed5a5eff9a')

Traceback (most recent call last):
  File "/home/goodboy/repos/tractor/tractor/trionics/_mngrs.py", line 172, in maybe_open_context
    value = cache.values[key]
KeyError: ('binance', 'ethusdt')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/goodboy/repos/tractor/tractor/_actor.py", line 186, in _invoke
    await chan.send({'return': await coro, 'cid': cid})
  File "/home/goodboy/repos/piker/piker/clearing/_ems.py", line 1023, in _emsd_main
    async with (
  File "/usr/lib/python3.9/contextlib.py", line 181, in __aenter__
    return await self.gen.__anext__()
  File "/home/goodboy/repos/piker/piker/clearing/_ems.py", line 339, in maybe_open_brokerd_trades_dialogue
    relay = await self.nursery.start(
  File "/home/goodboy/repos/piker/39/lib/python3.9/site-packages/trio/_core/_run.py", line 1041, in start
    task._eventual_parent_nursery = self
  File "/home/goodboy/repos/piker/39/lib/python3.9/site-packages/trio/_core/_run.py", line 815, in __aexit__
    raise combined_error_from_nursery
  File "/home/goodboy/repos/piker/piker/clearing/_ems.py", line 430, in open_brokerd_trades_dialogue
    async with (
  File "/usr/lib/python3.9/contextlib.py", line 181, in __aenter__
    return await self.gen.__anext__()
  File "/home/goodboy/repos/piker/piker/clearing/_paper_engine.py", line 519, in open_paperboi
    async with portal.open_context(
  File "/home/goodboy/repos/piker/39/lib/python3.9/site-packages/async_generator/_util.py", line 34, in __aenter__
    return await self._agen.asend(None)
  File "/home/goodboy/repos/tractor/tractor/_portal.py", line 428, in open_context
    first = msg['started']
KeyError: 'started'

And here is the msg contents:

(Pdb++) msg
{'return': None, 'cid': '19c78c86-4ded-4de0-9cb5-9905ffd74325'}

Not sure what the heck is going on but simple tests introduced to tractor don't seem to expose this problem which makes me think it's something to do with a race in ContextCancelled being raised in the target task and/or the wrong error being capture and set as Context._errror? it's that or some kinda wacky task-context id mixup is happening.

Needs some deep debugging inside _invoke() likely...

goodboy commented 2 years ago

Boom got it.

It's because we call await ctx.cancel() inside a finally:...from the containing open_feed(): block...

Removing this makes the serialization error propagate. Need to make a test for this in tractor..

Still no idea how to reproduce with unit test 😂

It seems to be some kind of race where the .started() call errors but the error is rewrapped and discarded during cancellation processing triggered by the calling actor? I'm lost how this is possible atm..