Open puddly opened 5 years ago
That does seem very bizarre. That line “function(self, user_data)” in _wrap_notify
is line 4811 in the current source, not 4725, and the enclosing scope contains no assignment to the function
argument. And the line “done.set_result(self.steal_reply())” is line 4877, not 4790. Are you sure you are using the latest version?
Do the dbussy_examples scripts run OK?
@ldo Installing it with pip install git+https://github.com/ldo/dbussy
does change the line number to 4811 but the same errors still occur.
As for the example scripts, signal_listener
, stats_server_ravelled
, bus_monitor
, and a few others work just fine. I wrote a lengthy script to interface with a Bluetooth device over BlueZ's DBus interface and it worked perfectly fine in isolation over hundreds of runs but immediately segfaulted my larger, pure-Python application when introduced. The above test case is the simplest that I could come up with that still reliably segfaulted.
Let me know if you need any more information.
I just ran your example script 100 times in a row on my Debian Unstable system, and it worked fine. I added some debug messages, just to guard against silent aborts, thus:
--- test/puddly_example-prev 2018-12-16 12:17:25.270107857 +1300
+++ test/puddly_example 2018-12-16 12:16:16.531169045 +1300
@@ -19,6 +19,8 @@
# Any large module works (e.g. aiohttp)
import numpy
+ print("end main") # debug
if __name__ == '__main__':
asyncio.run(main())
+print("end file") # debug
And they all appeared OK.
That's interesting. I at least get an asyncio.base_futures.InvalidStateError: invalid state
error but adding a small delay after the import numpy
line does fix it. Could it be a timing issue? I'm not using high-end hardware.
I'm able to replicate it again on an armv6 Raspberry Pi 2 B (also running Arch Linux ARM). I'll let you know in a few days when I narrow it down some more.
I used the following test loop to run your modified script:
for i in $(seq 1 100); do PYTHONPATH=. python3.7 test/puddly_example ; done
and when I stick a “| wc -l” on the end, I count exactly 200 lines of output, as expected.
My main machine is a Core i7-3770, which is about 6 years old now.
Does it work OK without that import
line? Is there something odd about your Numpy installation, then?
By the way, as of this moment, the SHA-256 sums for the .py
files in this repo are:
ldo@theon:dbussy> sha256sum *.py
ac6a41f9afd595077e244328137735b89eefb3fe6ff856cc318e993c9e56a025 dbussy.py
a17badbc9001c82dd1191184453538e7ce62f9dc61ac46f3be69a5bb7358da78 ravel.py
6203741c25d033b3f501d610ea4d1a46df85693cb44a542704f5d22e2387431d setup.py
Can you check you get the same answers?
Commenting out import numpy
still causes a segfault on the Raspberry Pi 2 but causes the script to successfully exit with no console output on the other device. I was originally using both asyncio
and numpy
(both simply installed via pip install ...
) but any one of the two individually produces a segfault. I suspect any other large module will work.
As for file integrity, I get the same SHA256-sums:
$ sha256sum venv_test/lib/python3.7/site-packages/*.py | grep -v easy
ac6a41f9afd595077e244328137735b89eefb3fe6ff856cc318e993c9e56a025 venv_test/lib/python3.7/site-packages/dbussy.py
a17badbc9001c82dd1191184453538e7ce62f9dc61ac46f3be69a5bb7358da78 venv_test/lib/python3.7/site-packages/ravel.py
Hardware-wise, I was originally running the code on an ODROID XU4 but I tested again on an Raspberry Pi 2 Model B with a fresh installation of Python 3.7.1. Both are armv6/7 and running Arch Linux ARM. The Raspberry Pi 2 is infuriatingly slow to use so I suspect that may have something to do with it. The only feature shared by both physical devices and my VM is slow single-core performance.
Just to test, I spun up a brand new 512MB Ubuntu 18.04 virtual machine through Vultr, installed python3.7 git python3-pip python3-wheel python3-setuptools
, and then ran python3.7 -m pip install numpy git+https://github.com/ldo/dbussy
. The test script produces a asyncio.base_futures.InvalidStateError: invalid state
error.
According to the docs, InvalidStateError
on a Future.set_result()
call means the Future has already had a result set. That Future is created internal to the PendingCall.await_reply()
method, and is accessed only by the inner pending_done
routine. I have previously found it could be called spuriously more than once before actual completion, which is why I put in the completed
check before setting the result. But as far as I know it is never called again after actual completion. Unless it is in this case ...
Is there an issue with doing a heavy import
inside a coroutine, perhaps? Would it be better moved outside?
Oddly enough, replacing the import
with a time.sleep()
call that takes exactly the same amount of time doesn't reliably reproduce the problem for me in the environments where the import
is required to make something break.
On the Raspberry Pi 2, no import
is necessary. Removing the entire line or even replacing it with await asyncio.sleep(10)
produces the following output (the last few lines are from re-running it with gdb
):
Task was destroyed but it is pending!
task: <Task pending coro=<def_proxy_interface.<locals>.def_method.<locals>.call_method() done, defined at /home/pi/venv_test/lib/python3.7/site-packages/ravel.py:3154> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x75aa3390>()]>>
Program received signal SIGSEGV, Segmentation fault.
0x76e4952c in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.7m.so.1.0
(gdb) backtrace
#0 0x76e4952c in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.7m.so.1.0
#1 0x76d706b4 in _PyFunction_FastCallDict () from /usr/lib/libpython3.7m.so.1.0
#2 0x7614a628 in ?? () from /home/pi/venv_test/lib/python3.7/lib-dynload/_ctypes.cpython-37m-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
By the way, the docs for asyncio.run()
say that it creates a new event loop! Not sure whether it sets it as the default event loop; remember that, if you don’t tell DBussy what event loop to use, it will use whatever is returned from asyncio.get_event_loop()
. If this is a different event loop, this will lead to confusion.
The docs also say that asyncio.run()
is a provisional API at this stage. Maybe avoid it for now?
You're right, it appears that asyncio.create_task
actually causes the segfault, not asyncio.run
. More strangeness on the armv6 platform:
Replacing asyncio.create_task
with asyncio.ensure_future
causes the following error to appear in place a segfault 50% of the time:
Traceback (most recent call last):
File "_ctypes/callbacks.c", line 232, in 'calling callback function'
TypeError: '' object is not callable
Creating an event loop at the top of the script and passing it to asyncio.ensure_future
and bus.attach_asyncio
causes a segfault to occur every time.
I'm not discounting the possibility of a Python or dbus packaging issue in Arch Linux ARM so I will check that out later. I was hoping that setting the Future
objects result twice might have something to do with this issue...
asyncio.create_task()
should work. But beware: you should store the returned Task
object somewhere. Otherwise you may get those intermittent “task was destroyed but is pending” errors, because asyncio’s _all_tasks
list only keeps weak references to tasks.
Yeah, worth trying some other distro on that same Raspberry π, maybe.
I think the root cause of the segfault and the other bizarre errors is libdbus never being notified that you're no longer interested in receiving a reply. It still tries to run a callback function that may no longer exist either because the event loop has shut down or because asyncio.wait_for
was used with a timeout.
Properly handling the done
future being canceled fixes the crash and all sporadic asyncio.base_futures.InvalidStateError: invalid state
messages for me:
class PendingCall:
async def await_reply(self):
...
try:
return await done
except asyncio.CancelledError:
self.cancel()
raise
I'll let it run for a few hours and see if it crashes again.
The only place where the done
Future can be cancelled is if you call PendingCall.cancel()
. In that situation, I would expect any pending await_reply()
call to propagate the exception back to the caller — isn’t that how it works? The actual timeout on the reply is implemented by libdbus itself, not by me, and it is supposed to return an error Message in that case.
By the way, I see a reference circularity between self
and the pending_done
callback, which I need to fix.
Here is a script to test the timeout mechanism with send_await_reply()
:
import sys
import asyncio
import getopt
import dbussy
from dbussy import \
DBUS
timeout = dbussy.DBUSX.DEFAULT_TIMEOUT
opts, args = getopt.getopt \
(
sys.argv[1:],
"",
["timeout="]
)
for keyword, value in opts :
if keyword == "--timeout" :
timeout = float(value)
if timeout < 0 :
raise getopt.GetoptError("--timeout value must be non-negative")
#end if
#end if
#end for
if len(args) != 1 :
raise getopt.GetoptError("expecting one arg, the limit to count primes up to")
#end if
limit = int(args[0])
if limit < 1 :
raise getopt.GetoptError("limit arg must be positive")
#end if
loop = asyncio.get_event_loop()
async def mainline() :
bus = await dbussy.Connection.bus_get_async \
(
type = DBUS.BUS_SESSION,
private = False
)
reply = await bus.send_await_reply \
(
message =
dbussy.Message.new_method_call \
(
destination = "com.example.slow_server",
path = "/",
iface = "com.example.slow_server",
method = "count_primes"
).append_objects("u", limit),
timeout = timeout
)
if reply.type == DBUS.MESSAGE_TYPE_METHOD_RETURN :
sys.stdout.write("nr primes up to %d = %d\n" % (limit, reply.all_objects[0]))
elif reply.type == DBUS.MESSAGE_TYPE_ERROR :
sys.stdout.write \
(
"got error reply %s -- %s\n" % (reply.error_name, reply.expect_objects("s")[0])
)
else :
sys.stdout.write("got reply type %d\n" % reply.type) # debug
#end if
#end mainline
loop.run_until_complete(mainline())
This is meant to run against the slow_dbus_server
. For example, on my machine, it can count the primes up to 1 million within the default timeout:
puddly_example_2 1000000
nr primes up to 1000000 = 78498
whereas making the timeout too short produces the expected error return from libdbus:
puddly_example_2 --timeout=0.1 1000000
got error reply org.freedesktop.DBus.Error.NoReply -- Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Is your code different in some way from this?
Not particularly, no. I get identical outputs on my problem device when using slow_dbus_server
and your above two invocations of the script.
My previous fix only "worked" by coincidence, as far as I can understand, since a restart yielded a crash 30 hours later.
I recompiled dbus with debugging symbols and narrowed the segfault down to the complete_pending_call_and_unlock
function, specifically where it calls the DBusPendingCall
struct's function
object. That memory address is written to exactly once when the DBUS.PendingCallNotifyFunction
object is created and assigned to PendingCall._wrap_notify
, and since you specifically store a reference to it, it should continue pointing to a valid callback function. Storing a reference to the Python _wrap_notify
function "helps" again, but I'm not sure what is causing this problem to begin with.
I'm working on getting a reproducible virtual machine started so you can actually see this problem occur yourself.
Hmmm ... are you keeping a reference to the PendingCall
object? Losing that could trigger a dangling-reference problem, I imagine.
Another possibility is that this is an architecture-dependent bug in ctypes
. I imagine it has to pull some tricks involving generating code at runtime to deal properly with closures, and this code would have to be quite different on ARM versus x86.
We've had similar issues in our systems (also armhf) which I've tried debugging with valgrind:
Task exception was never retrieved
future: <Task finished coro=<dbus_connect_ex() done, defined at /usr/lib/python3/dist-packages/plugindbus.py:471> exception=RuntimeError('cannot reuse already awaited coroutine',)>
Traceback (most recent call last):
File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
result = coro.send(None)
RuntimeError: cannot reuse already awaited coroutine
The SegFault is seen when trying to introspect a service that is not present on the system bus (yet). This retry mechanism is performed every 50ms.
Moving down the line, the SegFault occurs here:
self.set_notify(pending_done, weak_ref(self))
# avoid reference circularity self F~R pending_done F~R self
reply = await done
return \
reply
The plugindbus acts as a wrapper which uses dbussy
, writte for our testing server.
What helped was adding a small delay here:
await asyncio.sleep(0.01)
reply = await done
Before this change we would get a segfault more or less every 5 mins, after this change it's still running, 72 hrs later.
Maybe it helps someone.
Hi!
I've also got a segmentation fault when running a dbussy
-based application. I've used 1.2.0 & after approximately 13 minutes of runtime, the app crashed. I haven't been able to reproduce it, though.
I've bumped to 1.2.1 & now I've got this error:
Traceback (most recent call last):
File "_ctypes/callbacks.c", line 234, in 'calling callback function'
TypeError: _is_wrapper() takes 1 positional argument but 2 were given
I'll debug this more & come back with any insights.
I also had some segmentation faults, but they were all gone as soon as I was careful enough to keep a reference to tasks I created. The snippet of code in the issue description is not keeping a ref to the task. This is also true when cancelling a task. Either wait for it or keep a reference long enough. The task will be cancelled only at the next cycle.
task.cancel()
try:
await task
except asyncio.CancelledError:
pass
I had a sometimes segfault on my first call via dbussy (on armhf). It appears to be mitigated by a delay before that call per https://github.com/ldo/dbussy/issues/15#issuecomment-478924212.
I've run into a strange bug (that may be related to #13) sometimes reproducible by the following test case:
It specifically needs Python 3.7.1 for
asyncio.create_task
andasyncio.run
and triggers about 50% of the time on my armv7l server:On other platforms only the following occurs (again with Python 3.7.1):
Similar code is part of a much larger application and it doesn't segfault about 30% of the time but with occasional startup errors like:
And:
I've just somewhat figured out how to use the
ravel
module from the source code and the examples repository, so is this just me misusing it?