Open DavidAntliff opened 1 year ago
This is difficult to triage without a test case to reproduce locally. But, here are my first thoughts on the topic
('subprocess.CompletedProcess', 187650931238928, 281473569049664)
is the id pack for an initialized class of subprocess.CompletedProcess
. See rpyc.lib.__init__.get_id_pack
. This basically is how we uniquely identify objects w/ RPyCThanks for getting back to me. I will try with 5.3.1 as time allows.
This does seem like quite a rare thing - it only occurs within our fairly intensive and long-running integration test system, and rarely at that at an individual test level, but we see it often since we run so many tests. We've discovered that skipping specific tests seems to ensure it does not occur, so it seems to be related to certain tests and therefore perhaps certain kinds of workloads. It would be impractical to share this code as it's company-owned, I'm afraid, and I haven't been successful with a slimmed down minimum reproducible example either.
In our basic "does it work" tests where we thrash connections and messages, RPyC seems 100% robust, so there's an unusual situation involved somewhere, I think.
I'll do some more testing when I get time and get back to this thread when I have more info. Thanks again.
@DavidAntliff fair enough. You may want to try the new bind_threads
setting if you all use the threading module or service. The benchmarks seem to show comparable performance, and it helps with some of RPyC's less intuitive behaviors.
RPyC seems 100% robust
Things get more sketchy when threading is involved. You may want to look into where __del__
is called on the netref as it would decrement the local object ref count. I would venture a guess that it is decremented to zero and dropped from local cache before desired. I do think bind_threads might solve this issue for you. There are some fixes on master that aren't released yet as well.
I looks very much like Problem Number 3 I reported in #558 and in #559 I provide a fix. As this is a races with the garbage collector, it is more or less impossible to provide test code for this but looking at the code reveals that this is a race.
Maybe you can try my fork that I try to upstream with #559, if I am correct with my assumption.
I'm looking for some help debugging this, please.
I'm using RPyC 5.3.0 in a Ubuntu 22.04 Docker container, with Python 3.9, and rarely (but too frequently) I see the following occur, when trying to read a value from a remote
memoryview
(the variablemv
here, of type<netref class 'rpyc.core.netref.type'>
). This is running under pytest:My code performs many, many such read/write calls over a netref'd memoryview, and it's always one particular read that fails, if it fails at all. The access offset is correct.
I'm trying to gather clues so I have a few questions, please:
int
from the memoryview (return mv[offset]
), but the server throws this exception instead. What does that indicate went wrong? What could be happening to cause this?subprocess.CompletedProcess
coming from an internal, RPyC mechanism, or is that from my own code (or perhaps third-party code I call)? There is nothing to do withsubprocess
in any of my code to do with the memoryview, but there are calls to subprocess elsewhere in my program (on the server side).Recently, I was using RPyC 5.1.0 instead, and I was seeing something similar - the same server exception was being raised (as a debug-level log entry), but the client was oblivious to this and did not see an exception. The correct value would be returned to the client. But occasionally when this happens, the client would get a
None
value back fromreturn mv[offset]
, which is, of course, incorrect, and caused downstream code to fail. Updating to 5.3.0 seems to result in the exception propagating to the client, rather than returningNone
.EDIT: I've put a try/except/breakpoint around line 322 in
_unbox
, to catch the error whenlabel == consts.LABEL_LOCAL_REF
- and inspectedself._local_objects
. It's very large (589 entries) and has six that havesubprocess.CompletedProcess
in the key. All six of these have a match with the first number in the key's tuple, but none of them match with the second number (and the number in the key is significantly larger than any of the numbers in the_local_objects
collection).So it looks like this
subprocess.CompletedProcess
value is from my own code, but there seems to be a missing entry when RPyC unboxes it. What could lead to this happening? It seems to occur slightly more often if I load the server's CPU up to 100% (to slow it down a bit).