Strange segfault with ctypes

puddly commented 5 years ago

I've run into a strange bug (that may be related to #13) sometimes reproducible by the following test case:

import ravel
import asyncio

async def main():
    async def func():
        bus = ravel.system_bus()
        bus.attach_asyncio()

        manager = await bus['org.bluez']['/'].get_async_interface('org.freedesktop.DBus.ObjectManager')
        asyncio.create_task(manager.GetManagedObjects())

    # Not creating func causes only a "asyncio.base_futures.InvalidStateError: invalid state"
    await func()

    # Any large module works (e.g. aiohttp)
    import numpy

if __name__ == '__main__':
    asyncio.run(main())

It specifically needs Python 3.7.1 for asyncio.create_task and asyncio.run and triggers about 50% of the time on my armv7l server:

Program received signal SIGSEGV, Segmentation fault.
0xb6d45b4c in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.7m.so.1.0
(gdb) bt
#0  0xb6d45b4c in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.7m.so.1.0
#1  0xb6e43ab8 in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.7m.so.1.0
#2  0xb6d6c470 in _PyFunction_FastCallDict () from /usr/lib/libpython3.7m.so.1.0
#3  0xb5edb62c in ?? () from /home/test/venv3/lib/python3.7/lib-dynload/_ctypes.cpython-37m-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

On other platforms only the following occurs (again with Python 3.7.1):

Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 232, in 'calling callback function'
  File "/home/test/venv3/lib/python3.7/site-packages/dbussy.py", line 4725, in _wrap_notify
    function(self, user_data)
  File "/home/test/venv3/lib/python3.7/site-packages/dbussy.py", line 4790, in pending_done
    done.set_result(self.steal_reply())
asyncio.base_futures.InvalidStateError: invalid state

Similar code is part of a much larger application and it doesn't segfault about 30% of the time but with occasional startup errors like:

Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 232, in 'calling callback function'
  File "/home/test/venv3/lib/python3.7/site-packages/dbussy.py", line 4725, in _wrap_notify
    function(self, user_data)
NameError: free variable 'function' referenced before assignment in enclosing scope

And:

Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 232, in 'calling callback function'
TypeError: _seg_22() takes 0 positional arguments but 2 were given

I've just somewhat figured out how to use the ravel module from the source code and the examples repository, so is this just me misusing it?

ldo commented 5 years ago

That does seem very bizarre. That line “function(self, user_data)” in _wrap_notify is line 4811 in the current source, not 4725, and the enclosing scope contains no assignment to the function argument. And the line “done.set_result(self.steal_reply())” is line 4877, not 4790. Are you sure you are using the latest version?

Do the dbussy_examples scripts run OK?

puddly commented 5 years ago

@ldo Installing it with pip install git+https://github.com/ldo/dbussy does change the line number to 4811 but the same errors still occur.

As for the example scripts, signal_listener, stats_server_ravelled, bus_monitor, and a few others work just fine. I wrote a lengthy script to interface with a Bluetooth device over BlueZ's DBus interface and it worked perfectly fine in isolation over hundreds of runs but immediately segfaulted my larger, pure-Python application when introduced. The above test case is the simplest that I could come up with that still reliably segfaulted.

Let me know if you need any more information.

ldo commented 5 years ago

I just ran your example script 100 times in a row on my Debian Unstable system, and it worked fine. I added some debug messages, just to guard against silent aborts, thus:

--- test/puddly_example-prev    2018-12-16 12:17:25.270107857 +1300
+++ test/puddly_example 2018-12-16 12:16:16.531169045 +1300
@@ -19,6 +19,8 @@

     # Any large module works (e.g. aiohttp)
     import numpy
+    print("end main") # debug

 if __name__ == '__main__':
     asyncio.run(main())
+print("end file") # debug

And they all appeared OK.

puddly commented 5 years ago

That's interesting. I at least get an asyncio.base_futures.InvalidStateError: invalid state error but adding a small delay after the import numpy line does fix it. Could it be a timing issue? I'm not using high-end hardware.

I'm able to replicate it again on an armv6 Raspberry Pi 2 B (also running Arch Linux ARM). I'll let you know in a few days when I narrow it down some more.

ldo commented 5 years ago

I used the following test loop to run your modified script:

for i in $(seq 1 100); do PYTHONPATH=. python3.7 test/puddly_example ; done

and when I stick a “| wc -l” on the end, I count exactly 200 lines of output, as expected.

My main machine is a Core i7-3770, which is about 6 years old now.

Does it work OK without that import line? Is there something odd about your Numpy installation, then?

ldo commented 5 years ago

By the way, as of this moment, the SHA-256 sums for the .py files in this repo are:

ldo@theon:dbussy> sha256sum *.py
ac6a41f9afd595077e244328137735b89eefb3fe6ff856cc318e993c9e56a025  dbussy.py
a17badbc9001c82dd1191184453538e7ce62f9dc61ac46f3be69a5bb7358da78  ravel.py
6203741c25d033b3f501d610ea4d1a46df85693cb44a542704f5d22e2387431d  setup.py

Can you check you get the same answers?

puddly commented 5 years ago

Commenting out import numpy still causes a segfault on the Raspberry Pi 2 but causes the script to successfully exit with no console output on the other device. I was originally using both asyncio and numpy (both simply installed via pip install ...) but any one of the two individually produces a segfault. I suspect any other large module will work.

As for file integrity, I get the same SHA256-sums:

$ sha256sum venv_test/lib/python3.7/site-packages/*.py | grep -v easy
ac6a41f9afd595077e244328137735b89eefb3fe6ff856cc318e993c9e56a025  venv_test/lib/python3.7/site-packages/dbussy.py
a17badbc9001c82dd1191184453538e7ce62f9dc61ac46f3be69a5bb7358da78  venv_test/lib/python3.7/site-packages/ravel.py

Hardware-wise, I was originally running the code on an ODROID XU4 but I tested again on an Raspberry Pi 2 Model B with a fresh installation of Python 3.7.1. Both are armv6/7 and running Arch Linux ARM. The Raspberry Pi 2 is infuriatingly slow to use so I suspect that may have something to do with it. The only feature shared by both physical devices and my VM is slow single-core performance.

Just to test, I spun up a brand new 512MB Ubuntu 18.04 virtual machine through Vultr, installed python3.7 git python3-pip python3-wheel python3-setuptools, and then ran python3.7 -m pip install numpy git+https://github.com/ldo/dbussy. The test script produces a asyncio.base_futures.InvalidStateError: invalid state error.

ldo commented 5 years ago

According to the docs, InvalidStateError on a Future.set_result() call means the Future has already had a result set. That Future is created internal to the PendingCall.await_reply() method, and is accessed only by the inner pending_done routine. I have previously found it could be called spuriously more than once before actual completion, which is why I put in the completed check before setting the result. But as far as I know it is never called again after actual completion. Unless it is in this case ...

Is there an issue with doing a heavy import inside a coroutine, perhaps? Would it be better moved outside?

puddly commented 5 years ago

Oddly enough, replacing the import with a time.sleep() call that takes exactly the same amount of time doesn't reliably reproduce the problem for me in the environments where the import is required to make something break.

On the Raspberry Pi 2, no import is necessary. Removing the entire line or even replacing it with await asyncio.sleep(10) produces the following output (the last few lines are from re-running it with gdb):

Task was destroyed but it is pending!
task: <Task pending coro=<def_proxy_interface.<locals>.def_method.<locals>.call_method() done, defined at /home/pi/venv_test/lib/python3.7/site-packages/ravel.py:3154> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x75aa3390>()]>>

Program received signal SIGSEGV, Segmentation fault.
0x76e4952c in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.7m.so.1.0
(gdb) backtrace
#0  0x76e4952c in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.7m.so.1.0
#1  0x76d706b4 in _PyFunction_FastCallDict () from /usr/lib/libpython3.7m.so.1.0
#2  0x7614a628 in ?? () from /home/pi/venv_test/lib/python3.7/lib-dynload/_ctypes.cpython-37m-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

ldo commented 5 years ago

By the way, the docs for asyncio.run() say that it creates a new event loop! Not sure whether it sets it as the default event loop; remember that, if you don’t tell DBussy what event loop to use, it will use whatever is returned from asyncio.get_event_loop(). If this is a different event loop, this will lead to confusion.

The docs also say that asyncio.run() is a provisional API at this stage. Maybe avoid it for now?

puddly commented 5 years ago

You're right, it appears that asyncio.create_task actually causes the segfault, not asyncio.run. More strangeness on the armv6 platform:

Replacing asyncio.create_task with asyncio.ensure_future causes the following error to appear in place a segfault 50% of the time:

Traceback (most recent call last):
File "_ctypes/callbacks.c", line 232, in 'calling callback function'
TypeError: '' object is not callable

Creating an event loop at the top of the script and passing it to asyncio.ensure_future and bus.attach_asyncio causes a segfault to occur every time.

I'm not discounting the possibility of a Python or dbus packaging issue in Arch Linux ARM so I will check that out later. I was hoping that setting the Future objects result twice might have something to do with this issue...

ldo commented 5 years ago

asyncio.create_task() should work. But beware: you should store the returned Task object somewhere. Otherwise you may get those intermittent “task was destroyed but is pending” errors, because asyncio’s _all_tasks list only keeps weak references to tasks.

Yeah, worth trying some other distro on that same Raspberry π, maybe.

puddly commented 5 years ago

I think the root cause of the segfault and the other bizarre errors is libdbus never being notified that you're no longer interested in receiving a reply. It still tries to run a callback function that may no longer exist either because the event loop has shut down or because asyncio.wait_for was used with a timeout.

Properly handling the done future being canceled fixes the crash and all sporadic asyncio.base_futures.InvalidStateError: invalid state messages for me:

class PendingCall:
    async def await_reply(self):
        ...

        try:
            return await done
        except asyncio.CancelledError:
            self.cancel()
            raise

I'll let it run for a few hours and see if it crashes again.

ldo commented 5 years ago

The only place where the done Future can be cancelled is if you call PendingCall.cancel(). In that situation, I would expect any pending await_reply() call to propagate the exception back to the caller — isn’t that how it works? The actual timeout on the reply is implemented by libdbus itself, not by me, and it is supposed to return an error Message in that case.

By the way, I see a reference circularity between self and the pending_done callback, which I need to fix.

ldo commented 5 years ago

Here is a script to test the timeout mechanism with send_await_reply():

import sys
import asyncio
import getopt
import dbussy
from dbussy import \
    DBUS

timeout = dbussy.DBUSX.DEFAULT_TIMEOUT
opts, args = getopt.getopt \
  (
    sys.argv[1:],
    "",
    ["timeout="]
  )
for keyword, value in opts :
    if keyword == "--timeout" :
        timeout = float(value)
        if timeout < 0 :
            raise getopt.GetoptError("--timeout value must be non-negative")
        #end if
    #end if
#end for
if len(args) != 1 :
    raise getopt.GetoptError("expecting one arg, the limit to count primes up to")
#end if
limit = int(args[0])
if limit < 1 :
    raise getopt.GetoptError("limit arg must be positive")
#end if

loop = asyncio.get_event_loop()

async def mainline() :
    bus = await dbussy.Connection.bus_get_async \
      (
        type = DBUS.BUS_SESSION,
        private = False
      )
    reply = await bus.send_await_reply \
      (
        message =
            dbussy.Message.new_method_call \
              (
                destination = "com.example.slow_server",
                path = "/",
                iface = "com.example.slow_server",
                method = "count_primes"
              ).append_objects("u", limit),
        timeout = timeout
      )
    if reply.type == DBUS.MESSAGE_TYPE_METHOD_RETURN :
        sys.stdout.write("nr primes up to %d = %d\n" % (limit, reply.all_objects[0]))
    elif reply.type == DBUS.MESSAGE_TYPE_ERROR :
        sys.stdout.write \
          (
            "got error reply %s -- %s\n" % (reply.error_name, reply.expect_objects("s")[0])
          )
    else :
        sys.stdout.write("got reply type %d\n" % reply.type) # debug
    #end if
#end mainline

loop.run_until_complete(mainline())

This is meant to run against the slow_dbus_server. For example, on my machine, it can count the primes up to 1 million within the default timeout:

puddly_example_2 1000000
nr primes up to 1000000 = 78498

whereas making the timeout too short produces the expected error return from libdbus:

puddly_example_2 --timeout=0.1 1000000
got error reply org.freedesktop.DBus.Error.NoReply -- Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Is your code different in some way from this?

puddly commented 5 years ago

Not particularly, no. I get identical outputs on my problem device when using slow_dbus_server and your above two invocations of the script.

My previous fix only "worked" by coincidence, as far as I can understand, since a restart yielded a crash 30 hours later.

I recompiled dbus with debugging symbols and narrowed the segfault down to the complete_pending_call_and_unlock function, specifically where it calls the DBusPendingCall struct's function object. That memory address is written to exactly once when the DBUS.PendingCallNotifyFunction object is created and assigned to PendingCall._wrap_notify, and since you specifically store a reference to it, it should continue pointing to a valid callback function. Storing a reference to the Python _wrap_notify function "helps" again, but I'm not sure what is causing this problem to begin with.

I'm working on getting a reproducible virtual machine started so you can actually see this problem occur yourself.

ldo commented 5 years ago

Hmmm ... are you keeping a reference to the PendingCall object? Losing that could trigger a dangling-reference problem, I imagine.

Another possibility is that this is an architecture-dependent bug in ctypes. I imagine it has to pull some tricks involving generating code at runtime to deal properly with closures, and this code would have to be quite different on ARM versus x86.

alex-crisan commented 5 years ago

We've had similar issues in our systems (also armhf) which I've tried debugging with valgrind:

Task exception was never retrieved
future: <Task finished coro=<dbus_connect_ex() done, defined at /usr/lib/python3/dist-packages/plugindbus.py:471> exception=RuntimeError('cannot reuse already awaited coroutine',)>
Traceback (most recent call last):
  File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
    result = coro.send(None)
RuntimeError: cannot reuse already awaited coroutine

The SegFault is seen when trying to introspect a service that is not present on the system bus (yet). This retry mechanism is performed every 50ms.

Moving down the line, the SegFault occurs here:

       self.set_notify(pending_done, weak_ref(self))
          # avoid reference circularity self F~R pending_done F~R self
        reply = await done
        return \
            reply

The plugindbus acts as a wrapper which uses dbussy, writte for our testing server.

What helped was adding a small delay here:

        await asyncio.sleep(0.01)
        reply = await done

Before this change we would get a segfault more or less every 5 mins, after this change it's still running, 72 hrs later.

Maybe it helps someone.

lukaszdudek-silvair commented 5 years ago

Hi!

I've also got a segmentation fault when running a dbussy-based application. I've used 1.2.0 & after approximately 13 minutes of runtime, the app crashed. I haven't been able to reproduce it, though.

I've bumped to 1.2.1 & now I've got this error:

Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 234, in 'calling callback function'
TypeError: _is_wrapper() takes 1 positional argument but 2 were given

I'll debug this more & come back with any insights.

vincentbernat commented 3 years ago

I also had some segmentation faults, but they were all gone as soon as I was careful enough to keep a reference to tasks I created. The snippet of code in the issue description is not keeping a ref to the task. This is also true when cancelling a task. Either wait for it or keep a reference long enough. The task will be cancelled only at the next cycle.

task.cancel()
try:
    await task
except asyncio.CancelledError:
    pass

davesteele commented 3 years ago

I had a sometimes segfault on my first call via dbussy (on armhf). It appears to be mitigated by a delay before that call per https://github.com/ldo/dbussy/issues/15#issuecomment-478924212.

ldo / dbussy

Strange segfault with ctypes #15