Closed bernhardkaindl closed 2 years ago
Are you using blocking or asyncio?
asyncio
instead waits for more than a few seconds
What do you use to implement that?
If you use time.sleep
or any other blocking function the D-Bus processing will stop and your connection will reset because the D-Bus daemon will think that you are a dead process.
You want to use asyncio.sleep
in this case.
Also if you launch python with the -X dev
argument it will show if any function blocked for too long.
Many Thanks, the daemon is a responder to RPC calls.
I guess I might be a version or configuration difference on the embedded device.
When starting up, it does this:
sdbus.set_default_bus(sdbus.sd_bus_open_system())
network_manager = NetworkManager()
Then the server waits on a RabbitMQ queue (impemented using aio-pika
) to read RPC requests.
Later, when a RPC request to return the states of devices comes in, this gets called (thru the RPC method below):
async def get_device_states(**kwargs: Any) -> DeviceStates:
"""Closure using network_manager for returning connection states"""
dev_states: DeviceStates = []
kwargs = kwargs.get("kwargs", kwargs) # workaround kwargs["kwargs"]
device_paths = await network_manager.get_devices()
for device_path in device_paths:
if not await (device := Device(device_path)).is_hw():
continue
dev_states.append(await get_device_jsondict(device, **kwargs))
return dev_states
@server.registry.add # type: ignore
async def device_states(**kwargs: Any) -> DeviceStates:
"""RPC method, calls connectionstates() to return connection statues"""
connection_state_list = await get_device_states()
await trigger_connectivity_check()
server.logger.info(server.pformat(connection_state_list))
return connection_state_list
No problem was shown during development on my notebook using Ubuntu 21.04. dbus-daemon versions: Ubuntu 21.04(development machine) 1.12.20 Embedded Device: 1.12.16 Is there a dbus-daemon build or config option which might affect this in different ways?
Is there some special python version on embedded device?
The daemon runs inside a docker container based on Ubuntu 20.04-armhf
I run it also directly outside of the container, it is a pyinstaller archive, so it runs as long as the glibc is new enlough, the issue happens in both cases, inside of the container and outside, directly on the host os:
The python running the daemon is packaged with the daemon code using pyinstaller --onefile
inside a Ubuntu 20.04 armhf container created for building the pyinstaller archive.
The python used to run the pyinstaller which is packaged in it is the python of 20.04-armhf
@armhf-focal:~ $ python3 -V
Python 3.8.10
I would try running python with -X dev
flag. (or at least set the event loop to the debug mode)
What is weird is that error code 110 is ETIMEDOUT on my machine but the sdbus manual does not say that ETIMEDOUT can be returned by sd_bus_process
.
However, it is possible that error codes differ on different architectures.
I would also try to get logs from the D-Bus daemon. (maybe it can reveal something)
ETIMEDOUT might be returned from the socket itself. Not sure. Definitely looks like something is wrong with connection. Do the blocking methods have same issues?
Could I extend python-sdbus
so I can call the sdbus call sd_bus_close
to close the dbus connection?
I'd then close the connection after I've finished the batch of dbus calls to the networkmanager and reopen a new one when I going to send new requests? Thisng
Definitely possible.
It should already close if all the references to the bus had dropped. The issue is that all created interfaces keep the references to the bus. Also if you set the default bus the reference is kept around.
By the way there is an option to connect over SSH in sdbus.
I added SdBus.close(): https://github.com/python-sdbus/python-sdbus/commit/4b72d0f70bb0d1b5b174025a99ad1c8fa71bf473
I have plans to expose entire sdbus API as low-level API but only after high level API is finished.
thanks! I use sdbus.set_default_bus()
(with sdbus.sd_bus_open_system()
):
So I also need to clear the default bus to call sdbus.set_default_bus(sdbus.sd_bus_open_system()
again. Looking for it, but in case you are faster, it would be great!
I am thinking that sd_bus_start()
can reconnect the existing bus. I will add that call.
I added it: https://github.com/python-sdbus/python-sdbus/commit/86f87e4367d870314765fb71f77d59b1ba4db22d
I am not entirely sure how it will work. It might need sd_bus_set_address
calls before but those values might be set by sd_bus_open_system
before hand.
thanks! I use
sdbus.set_default_bus()
(withsdbus.sd_bus_open_system()
):So I also need to clear the default bus to call
sdbus.set_default_bus(sdbus.sd_bus_open_system()
again. Looking for it, but in case you are faster, it would be great!
You would need to recreate all interface objects after switching the default bus. Currently the bus of the interface is just kept as an attribute. (might be able to manually change it, it is located at _attached_bus
)
@igo95862 Many thanks for all the great tips and the commits! Here are the resulting insights:
Getting logs from the D-Bus daemon reveals the cause:
dbus-daemon: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30003ms)
The 30s are hard-coded: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818362#10
The same messager were caused by an agent using sdbus also this March: https://community.dynatrace.com/t5/Dynatrace-Open-Q-A/WARNING-OneAgent-1-232-1-233-can-cause-a-slow-memory-leak-on/m-p/182536 "issue is caused by gathering data via sdbus in OneAgent. Hotfix is already available"
This http://doc.mawan.de/dbus/README.Debian says that the auth_timeout
was an (unsuccessful) counter-measure against to local denial-of-service attacks thru the dbus-daemon by opening lots of connections to it.
Apparently, newer dbus-daemons implemented a bitter fix where it looks which user-id tries to connect to the daemon (from the 1st link):
In general we can't tell whose a connection is until it has authenticated,
but on Linux with the default system bus configuration we can, so in newer
upstream versions we might be able to mitigate this sort of thing by
making uid 0 immune to these limits. Would that solve this for you?
NM.check_connectivity()
), the timeout disappeared.Observations from trying your commits:
NetworkDeviceGeneric
always hung the task, despit my efforts to open a new bus or restart it, even without having set_default_bus()
in the program.As I need the periodic task anyway, I am keeping it and use this as solution to prevent getting disconnected due to no authenticatin for 30 seconds.
As far as I can tell, the start call for sd_bus_start()
can be safely reverted and the bus.close()
caused a hang in the await of instance of (NetworkDeviceGeneric) always, so you can backout/undo these commits.
Many Thanks and best Regards! Bernhard
Hmm. Any idea why it only happens on embedded device? Is it because of docker container?
There are some calls related to negotiation in sdbus. Maybe they are required?
- Whenever SdBus.close() was used, reading an asnyc property from
NetworkDeviceGeneric
always hung the task, despit my efforts to open a new bus or restart it, even without havingset_default_bus()
in the program.
Is it just properties that hang up? What about methods or signals? Could be that the cancellation is not implemented properly for properties.
As far as I can tell, the start call for
sd_bus_start()
can be safely reverted and thebus.close()
caused a hang in the await of instance of (NetworkDeviceGeneric) always, so you can backout/undo these commits.
I was planing on providing the entire sdbus API anyway. Maybe someone will use these function.
Hmm. Any idea why it only happens on embedded device? Is it because of docker container?
No, the dbus-daemon does not run in a container, but it's the old dbus-daemon 1.12.16, which might still have the useless 30 seconds auth_timeout despite as root, it can check the user and do a far better rate-limiting based on the user id of the process.
There are some calls related to negotiation in sdbus. Maybe they are required?
I have no time right now, but here is the traceback:
Exception in callback SdBus.drive() base_events.py:1707
handle: <Handle SdBus.drive()>
Traceback (most recent call last):
File "asyncio/events.py", line 81, in _run
sd_bus_internals.SdBusLibraryError: Line: 270. return_value in function SdBus_drive returned error: 110
Exception in callback SdBus.drive()
It is printed in a loop, sometimes very fast, even consuming more than 20% of CPU.
What is really needed is to raise some higher-level exception so that the process can choose to terminate itself in order.
- Whenever SdBus.close() was used, reading an asnyc property from
NetworkDeviceGeneric
always hung the task, despit my efforts to open a new bus or restart it, even without havingset_default_bus()
in the program.Is it just properties that hang up? What about methods or signals? Could be that the cancellation is not implemented properly for properties.
Probably, the call to NetworkDeviceGeneric(dbus_path) succeeds, but that does not await and does apparently not talk to D-Bus.
The most important thing would be to raise the timeout error code to the application level, so it can terminate (it fact, it should unless it a recovery is tested) should it ever occur again.
I will look in to better terminating connections and propagating the errors.
Probably, the call to NetworkDeviceGeneric(dbus_path) succeeds, but that does not await and does apparently not talk to D-Bus.
Yes. It only initiates the python class without talking to D-Bus.
The most important thing would be to raise the timeout error code to the application level, so it can terminate (it fact, it should unless it a recovery is tested) should it ever occur again.
The current code around the event loop and polling connection was basically a minimum viable product.
I added test for closing the bus while waiting on method but it seems to work correctly straight away: https://github.com/python-sdbus/python-sdbus/commit/a27b8dd45810b43d831e2575dcda40879c7ce6f2
I pushed commit https://github.com/python-sdbus/python-sdbus/commit/9fea8cc2170ba853a9c3d809d884063f86782d62 to improve error handling if it occurred during processing sdbus.
Previously the 20% CPU usage was probably caused by file descriptor watcher not being removed if any error raised by sd_bus_process
was not ECONNRESET. Now any error will cause the reader to be removed.
Now the bus is pretty much unusable if an error occurred in processing because the reader is going to be unregistered but it is not going to be re-registrered. I am not sure if this is the correct behavior but given that sd_bus_start
cannot be called to restart the bus I believe it is.
The question for me is should the sd_bus_close
be called to signal any methods waiting to be cancelled. I tried the method of killing dbus daemon but it also seems to be cancelling methods correctly.
@bernhardkaindl Any updates on the issue? I wonder how will my latest fixes for sdbus interact with it.
Works great! When I let the server run into the timeout to get disconnected, I get:
method unhandled exception DbusNoReplyError('Connection terminated')
Traceback:
...
... = await network_manager.get_devices()
File ".local/lib/python3.9/site-packages/sdbus/dbus_proxy_async.py", line 113, in _call_dbus_async
reply_message = await interface._attached_bus.call_async(
sdbus.dbus_exceptions.DbusNoReplyError: Connection terminated
Thanks!
Hi, am working on a daemon which is running on a remote device.
It uses
sdbus_async.networkmanager
to connect to the NetworkManager of that device.sdbus
0.8.5 and 0.9.0 (both show the problem)The behavior I see is the following:
When the daemon, after connecting to the NetworkManager, exectues sdbus-calls to it immediately, the calls succeed.
When the damon instead waits for more than a few seconds before sending a request, this happens:
When the daemon waits even longer, this happens in a endless loop:
Is there a way to recover from these conditions?
I saw a similar error in https://github.com/python-sdbus/python-sdbus/issues/1 - Maybe it helps to find the issue.