Help with debugging a RPyC-related issue: KeyError in colls.py - how to debug this further?

DavidAntliff commented 1 year ago

I'm looking for some help debugging this, please.

I'm using RPyC 5.3.0 in a Ubuntu 22.04 Docker container, with Python 3.9, and rarely (but too frequently) I see the following occur, when trying to read a value from a remote memoryview (the variable mv here, of type <netref class 'rpyc.core.netref.type'>). This is running under pytest:

memory_access.py:59: in read
    return mv[offset]
/home/docker/.local/lib/python3.9/site-packages/rpyc/core/netref.py:261: in method
    return syncreq(_self, consts.HANDLE_CALLATTR, name, args, kwargs)
/home/docker/.local/lib/python3.9/site-packages/rpyc/core/netref.py:63: in syncreq
    return conn.sync_request(handler, proxy, *args)
/home/docker/.local/lib/python3.9/site-packages/rpyc/core/protocol.py:718: in sync_request
    return _async_res.value
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <AsyncResult object (ready) at 0x7f08413a79f0>
    @property
    def value(self):
        """Returns the result of the operation. If the result has not yet
        arrived, accessing this property will wait for it. If the result does
        not arrive before the expiry time elapses, :class:`AsyncResultTimeout`
        is raised. If the returned result is an exception, it will be raised
        here. Otherwise, the result is returned directly.
        """
        self.wait()
        if self._is_exc:
>           raise self._obj
E           _get_exception_class.<locals>.Derived: ('subprocess.CompletedProcess', 187650931238928, 281473569049664)
E           
E           ========= Remote Traceback (1) =========
E           Traceback (most recent call last):
E             File "/usr/lib/python3.8/site-packages/rpyc/core/protocol.py", line 355, in _dispatch_request
E               args = self._unbox(args)
E             File "/usr/lib/python3.8/site-packages/rpyc/core/protocol.py", line 321, in _unbox
E               return tuple(self._unbox(item) for item in value)
E             File "/usr/lib/python3.8/site-packages/rpyc/core/protocol.py", line 321, in <genexpr>
E               return tuple(self._unbox(item) for item in value)
E             File "/usr/lib/python3.8/site-packages/rpyc/core/protocol.py", line 323, in _unbox
E               return self._local_objects[value]
E             File "/usr/lib/python3.8/site-packages/rpyc/lib/colls.py", line 110, in __getitem__
E               return self._dict[key][0]
E           KeyError: ('subprocess.CompletedProcess', 187650931238928, 281473569049664)
/home/docker/.local/lib/python3.9/site-packages/rpyc/core/async_.py:108: KeyError

My code performs many, many such read/write calls over a netref'd memoryview, and it's always one particular read that fails, if it fails at all. The access offset is correct.

I'm trying to gather clues so I have a few questions, please:

How should I interpret this trace? The client is expecting to read a value of type int from the memoryview (return mv[offset]), but the server throws this exception instead. What does that indicate went wrong? What could be happening to cause this?
is the key subprocess.CompletedProcess coming from an internal, RPyC mechanism, or is that from my own code (or perhaps third-party code I call)? There is nothing to do with subprocess in any of my code to do with the memoryview, but there are calls to subprocess elsewhere in my program (on the server side).

Recently, I was using RPyC 5.1.0 instead, and I was seeing something similar - the same server exception was being raised (as a debug-level log entry), but the client was oblivious to this and did not see an exception. The correct value would be returned to the client. But occasionally when this happens, the client would get a None value back from return mv[offset], which is, of course, incorrect, and caused downstream code to fail. Updating to 5.3.0 seems to result in the exception propagating to the client, rather than returning None.

EDIT: I've put a try/except/breakpoint around line 322 in _unbox, to catch the error when label == consts.LABEL_LOCAL_REF - and inspected self._local_objects. It's very large (589 entries) and has six that have subprocess.CompletedProcess in the key. All six of these have a match with the first number in the key's tuple, but none of them match with the second number (and the number in the key is significantly larger than any of the numbers in the _local_objects collection).

So it looks like this subprocess.CompletedProcess value is from my own code, but there seems to be a missing entry when RPyC unboxes it. What could lead to this happening? It seems to occur slightly more often if I load the server's CPU up to 100% (to slow it down a bit).

comrumino commented 1 year ago

This is difficult to triage without a test case to reproduce locally. But, here are my first thoughts on the topic

('subprocess.CompletedProcess', 187650931238928, 281473569049664) is the id pack for an initialized class of subprocess.CompletedProcess. See rpyc.lib.__init__.get_id_pack. This basically is how we uniquely identify objects w/ RPyC
It's trying to unbox a netref that it has not seen before. Which, seems problematic to say the least. If it is happening as load increases, it is most likely a race condition. Are you able to reproduce this issue on the latest version?

DavidAntliff commented 1 year ago

Thanks for getting back to me. I will try with 5.3.1 as time allows.

This does seem like quite a rare thing - it only occurs within our fairly intensive and long-running integration test system, and rarely at that at an individual test level, but we see it often since we run so many tests. We've discovered that skipping specific tests seems to ensure it does not occur, so it seems to be related to certain tests and therefore perhaps certain kinds of workloads. It would be impractical to share this code as it's company-owned, I'm afraid, and I haven't been successful with a slimmed down minimum reproducible example either.

In our basic "does it work" tests where we thrash connections and messages, RPyC seems 100% robust, so there's an unusual situation involved somewhere, I think.

I'll do some more testing when I get time and get back to this thread when I have more info. Thanks again.

comrumino commented 1 year ago

@DavidAntliff fair enough. You may want to try the new bind_threads setting if you all use the threading module or service. The benchmarks seem to show comparable performance, and it helps with some of RPyC's less intuitive behaviors.

comrumino commented 1 year ago

RPyC seems 100% robust

Things get more sketchy when threading is involved. You may want to look into where __del__ is called on the netref as it would decrement the local object ref count. I would venture a guess that it is decremented to zero and dropped from local cache before desired. I do think bind_threads might solve this issue for you. There are some fixes on master that aren't released yet as well.

rlehfeld commented 4 months ago

I looks very much like Problem Number 3 I reported in #558 and in #559 I provide a fix. As this is a races with the garbage collector, it is more or less impossible to provide test code for this but looking at the code reveals that this is a race.

Maybe you can try my fork that I try to upstream with #559, if I am correct with my assumption.

tomerfiliba-org / rpyc

Help with debugging a RPyC-related issue: KeyError in colls.py - how to debug this further? #526