DbusNoReplyError: Connection terminated and Exception in callback SdBus.drive()

bernhardkaindl commented 2 years ago

Hi, am working on a daemon which is running on a remote device.

It uses sdbus_async.networkmanager to connect to the NetworkManager of that device.

The Networkmanager running on it is NetworkManager-1.28 (The root filesystem of that device is built by an embedded build system and it would take some effort to rebuild it)
I use sdbus 0.8.5 and 0.9.0 (both show the problem)

The behavior I see is the following:

When the daemon, after connecting to the NetworkManager, exectues sdbus-calls to it immediately, the calls succeed.

When the damon instead waits for more than a few seconds before sending a request, this happens:

sdbus.dbus_exceptions.DbusNoReplyError: Connection terminated

When the daemon waits even longer, this happens in a endless loop:

Exception in callback SdBus.drive()  base_events.py:1707
handle: <Handle SdBus.drive()>                                                                                
Traceback (most recent call last):                                                                            
   File "asyncio/events.py", line 81, in _run                                                                  
sd_bus_internals.SdBusLibraryError: Line: 270. return_value in function SdBus_drive                           
returned error: 110

Is there a way to recover from these conditions?

I saw a similar error in https://github.com/python-sdbus/python-sdbus/issues/1 - Maybe it helps to find the issue.

igo95862 commented 2 years ago

Are you using blocking or asyncio?

bernhardkaindl commented 2 years ago

asyncio

igo95862 commented 2 years ago

instead waits for more than a few seconds

What do you use to implement that?

igo95862 commented 2 years ago

If you use time.sleep or any other blocking function the D-Bus processing will stop and your connection will reset because the D-Bus daemon will think that you are a dead process.

You want to use asyncio.sleep in this case.

Also if you launch python with the -X dev argument it will show if any function blocked for too long.

bernhardkaindl commented 2 years ago

Many Thanks, the daemon is a responder to RPC calls.

It uses an asyncio-based framework and I added no blocking waits: https://pjrpc.readthedocs.io/en/stable/
On my development machine, this does not show,

I guess I might be a version or configuration difference on the embedded device.

When starting up, it does this:

    sdbus.set_default_bus(sdbus.sd_bus_open_system())
    network_manager = NetworkManager()

Then the server waits on a RabbitMQ queue (impemented using aio-pika) to read RPC requests.

Later, when a RPC request to return the states of devices comes in, this gets called (thru the RPC method below):

    async def get_device_states(**kwargs: Any) -> DeviceStates:
        """Closure using network_manager for returning connection states"""
        dev_states: DeviceStates = []
        kwargs = kwargs.get("kwargs", kwargs)  # workaround kwargs["kwargs"]
        device_paths = await network_manager.get_devices()
        for device_path in device_paths:
            if not await (device := Device(device_path)).is_hw():
                continue
            dev_states.append(await get_device_jsondict(device, **kwargs))
        return dev_states

    @server.registry.add  # type: ignore
    async def device_states(**kwargs: Any) -> DeviceStates:
        """RPC method, calls connectionstates() to return connection statues"""
        connection_state_list = await get_device_states()
        await trigger_connectivity_check()
        server.logger.info(server.pformat(connection_state_list))
        return connection_state_list

No problem was shown during development on my notebook using Ubuntu 21.04. dbus-daemon versions: Ubuntu 21.04(development machine) 1.12.20 Embedded Device: 1.12.16 Is there a dbus-daemon build or config option which might affect this in different ways?

igo95862 commented 2 years ago

Is there some special python version on embedded device?

bernhardkaindl commented 2 years ago

The daemon runs inside a docker container based on Ubuntu 20.04-armhf

I run it also directly outside of the container, it is a pyinstaller archive, so it runs as long as the glibc is new enlough, the issue happens in both cases, inside of the container and outside, directly on the host os:

The python running the daemon is packaged with the daemon code using pyinstaller --onefile inside a Ubuntu 20.04 armhf container created for building the pyinstaller archive. The python used to run the pyinstaller which is packaged in it is the python of 20.04-armhf

@armhf-focal:~ $ python3 -V
Python 3.8.10

igo95862 commented 2 years ago

I would try running python with -X dev flag. (or at least set the event loop to the debug mode)

What is weird is that error code 110 is ETIMEDOUT on my machine but the sdbus manual does not say that ETIMEDOUT can be returned by sd_bus_process.

However, it is possible that error codes differ on different architectures.

igo95862 commented 2 years ago

I would also try to get logs from the D-Bus daemon. (maybe it can reveal something)

igo95862 commented 2 years ago

ETIMEDOUT might be returned from the socket itself. Not sure. Definitely looks like something is wrong with connection. Do the blocking methods have same issues?

bernhardkaindl commented 2 years ago

Could I extend python-sdbus so I can call the sdbus call sd_bus_close to close the dbus connection?

I'd then close the connection after I've finished the batch of dbus calls to the networkmanager and reopen a new one when I going to send new requests? Thisng

igo95862 commented 2 years ago

Definitely possible.

It should already close if all the references to the bus had dropped. The issue is that all created interfaces keep the references to the bus. Also if you set the default bus the reference is kept around.

igo95862 commented 2 years ago

By the way there is an option to connect over SSH in sdbus.

igo95862 commented 2 years ago

I added SdBus.close(): https://github.com/python-sdbus/python-sdbus/commit/4b72d0f70bb0d1b5b174025a99ad1c8fa71bf473

I have plans to expose entire sdbus API as low-level API but only after high level API is finished.

bernhardkaindl commented 2 years ago

thanks! I use sdbus.set_default_bus() (with sdbus.sd_bus_open_system()):

So I also need to clear the default bus to call sdbus.set_default_bus(sdbus.sd_bus_open_system() again. Looking for it, but in case you are faster, it would be great!

igo95862 commented 2 years ago

I am thinking that sd_bus_start() can reconnect the existing bus. I will add that call.

igo95862 commented 2 years ago

I added it: https://github.com/python-sdbus/python-sdbus/commit/86f87e4367d870314765fb71f77d59b1ba4db22d

I am not entirely sure how it will work. It might need sd_bus_set_address calls before but those values might be set by sd_bus_open_system before hand.

igo95862 commented 2 years ago

thanks! I use sdbus.set_default_bus() (with sdbus.sd_bus_open_system()):

So I also need to clear the default bus to call sdbus.set_default_bus(sdbus.sd_bus_open_system() again. Looking for it, but in case you are faster, it would be great!

You would need to recreate all interface objects after switching the default bus. Currently the bus of the interface is just kept as an attribute. (might be able to manually change it, it is located at _attached_bus)

bernhardkaindl commented 2 years ago

@igo95862 Many thanks for all the great tips and the commits! Here are the resulting insights:

Getting logs from the D-Bus daemon reveals the cause:

dbus-daemon: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30003ms)

The 30s are hard-coded: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818362#10 The same messager were caused by an agent using sdbus also this March: https://community.dynatrace.com/t5/Dynatrace-Open-Q-A/WARNING-OneAgent-1-232-1-233-can-cause-a-slow-memory-leak-on/m-p/182536 "issue is caused by gathering data via sdbus in OneAgent. Hotfix is already available" This http://doc.mawan.de/dbus/README.Debian says that the auth_timeout was an (unsuccessful) counter-measure against to local denial-of-service attacks thru the dbus-daemon by opening lots of connections to it.

Apparently, newer dbus-daemons implemented a bitter fix where it looks which user-id tries to connect to the daemon (from the 1st link):

In general we can't tell whose a connection is until it has authenticated,
but on Linux with the default system bus configuration we can, so in newer
upstream versions we might be able to mitigate this sort of thing by
making uid 0 immune to these limits. Would that solve this for you?

After adding an initial dbus-networkmanager query (in another async task, in fact I anyway needed add do periodic calls), I now alway have one initial dbus-call to NM and I repeat it every 60 seconds (I call NM.check_connectivity()), the timeout disappeared.

Observations from trying your commits:

Whenever SdBus.close() was used, reading an asnyc property from NetworkDeviceGeneric always hung the task, despit my efforts to open a new bus or restart it, even without having set_default_bus() in the program.
sd_bus_start() always threw an exception: According to the manual it can only be used once on the connection.

As I need the periodic task anyway, I am keeping it and use this as solution to prevent getting disconnected due to no authenticatin for 30 seconds.

bernhardkaindl commented 2 years ago

As far as I can tell, the start call for sd_bus_start() can be safely reverted and the bus.close() caused a hang in the await of instance of (NetworkDeviceGeneric) always, so you can backout/undo these commits.

Many Thanks and best Regards! Bernhard

igo95862 commented 2 years ago

Hmm. Any idea why it only happens on embedded device? Is it because of docker container?

There are some calls related to negotiation in sdbus. Maybe they are required?

Whenever SdBus.close() was used, reading an asnyc property from NetworkDeviceGeneric always hung the task, despit my efforts to open a new bus or restart it, even without having set_default_bus() in the program.

Is it just properties that hang up? What about methods or signals? Could be that the cancellation is not implemented properly for properties.

As far as I can tell, the start call for sd_bus_start() can be safely reverted and the bus.close() caused a hang in the await of instance of (NetworkDeviceGeneric) always, so you can backout/undo these commits.

I was planing on providing the entire sdbus API anyway. Maybe someone will use these function.

bernhardkaindl commented 2 years ago

Hmm. Any idea why it only happens on embedded device? Is it because of docker container?

No, the dbus-daemon does not run in a container, but it's the old dbus-daemon 1.12.16, which might still have the useless 30 seconds auth_timeout despite as root, it can check the user and do a far better rate-limiting based on the user id of the process.

There are some calls related to negotiation in sdbus. Maybe they are required?

I have no time right now, but here is the traceback:

Exception in callback SdBus.drive()                                                                             base_events.py:1707
handle: <Handle SdBus.drive()>                                                                                                     
Traceback (most recent call last):                                                                                                 
  File "asyncio/events.py", line 81, in _run                                                                                       
sd_bus_internals.SdBusLibraryError: Line: 270. return_value in function SdBus_drive returned error: 110                            
Exception in callback SdBus.drive()

It is printed in a loop, sometimes very fast, even consuming more than 20% of CPU.

What is really needed is to raise some higher-level exception so that the process can choose to terminate itself in order.

Whenever SdBus.close() was used, reading an asnyc property from NetworkDeviceGeneric always hung the task, despit my efforts to open a new bus or restart it, even without having set_default_bus() in the program.

Is it just properties that hang up? What about methods or signals? Could be that the cancellation is not implemented properly for properties.

Probably, the call to NetworkDeviceGeneric(dbus_path) succeeds, but that does not await and does apparently not talk to D-Bus.

The most important thing would be to raise the timeout error code to the application level, so it can terminate (it fact, it should unless it a recovery is tested) should it ever occur again.

igo95862 commented 2 years ago

I will look in to better terminating connections and propagating the errors.

Probably, the call to NetworkDeviceGeneric(dbus_path) succeeds, but that does not await and does apparently not talk to D-Bus.

Yes. It only initiates the python class without talking to D-Bus.

The most important thing would be to raise the timeout error code to the application level, so it can terminate (it fact, it should unless it a recovery is tested) should it ever occur again.

The current code around the event loop and polling connection was basically a minimum viable product.

igo95862 commented 2 years ago

I added test for closing the bus while waiting on method but it seems to work correctly straight away: https://github.com/python-sdbus/python-sdbus/commit/a27b8dd45810b43d831e2575dcda40879c7ce6f2

igo95862 commented 2 years ago

I pushed commit https://github.com/python-sdbus/python-sdbus/commit/9fea8cc2170ba853a9c3d809d884063f86782d62 to improve error handling if it occurred during processing sdbus.

Previously the 20% CPU usage was probably caused by file descriptor watcher not being removed if any error raised by sd_bus_process was not ECONNRESET. Now any error will cause the reader to be removed.

Now the bus is pretty much unusable if an error occurred in processing because the reader is going to be unregistered but it is not going to be re-registrered. I am not sure if this is the correct behavior but given that sd_bus_start cannot be called to restart the bus I believe it is.

The question for me is should the sd_bus_close be called to signal any methods waiting to be cancelled. I tried the method of killing dbus daemon but it also seems to be cancelling methods correctly.

igo95862 commented 2 years ago

@bernhardkaindl Any updates on the issue? I wonder how will my latest fixes for sdbus interact with it.

bernhardkaindl commented 2 years ago

Works great! When I let the server run into the timeout to get disconnected, I get:

method unhandled exception DbusNoReplyError('Connection terminated')
Traceback:
...
    ... = await network_manager.get_devices()                                                                                                      
  File ".local/lib/python3.9/site-packages/sdbus/dbus_proxy_async.py", line 113, in _call_dbus_async                                          
    reply_message = await interface._attached_bus.call_async(                                                                                               
sdbus.dbus_exceptions.DbusNoReplyError: Connection terminated

Thanks!

python-sdbus / python-sdbus-networkmanager

DbusNoReplyError: Connection terminated and Exception in callback SdBus.drive() #17