openbmc / phosphor-state-manager

Apache License 2.0
11 stars 21 forks source link

How to handle the exception thrown by bus.process_discard()? #25

Closed powerfulkelly closed 8 months ago

powerfulkelly commented 8 months ago

In the main function of bmc_state_manager_main.cpp, how does it handle the exception thrown by bus.process_discard()? If bus.process_discard() throws an exception, it seems that there is no "try-catch" the exception in the main function of bmc_state_manager_main.cpp. Does this qualify as a bug?

williamspatrick commented 8 months ago

Generally if an exception escapes a process_discard the state of the sd_bus object is destroyed anyhow because most of the sd_bus interactions are at a C-level and are not able to handle exceptions. It's probably best that the application crashes in that case, we collect a core-dump, and debug the issue. Attempting to catch / log the exception and continue on likely results in an un-determined application state.

Those are the two general approaches for unforeseen exceptions:

We've generally chosen the std::terminate approach because a log+continue approach doesn't usually get noticed as quickly. core-dumps are suppose to be gathered by phosphor-dump-collector automatically and can raise awareness of the problem much quicker.

powerfulkelly commented 8 months ago

Thanks for your reply, I fully agree with your point of view. We did encounter this issue where calling bus.process_discard() then returned r = -104 and resulted in a core dump, but we were unable to see the status of the sd_bus at that time. From the core-dump file, we only saw that bus.process_discard() returned r = -104 and thrown an exception, and then the process exited. We didn't capture the journal log, which increases the difficulty for us to further investigate the root cause. Do you have any good ideas to continue debugging this issue?

williamspatrick commented 8 months ago

From errno.h

#define ECONNRESET  104 /* Connection reset by peer */

From manpage on sd_bus_process:

       -ECONNRESET
           The bus connection has been terminated just now.

The systemd code for sd_bus_process: https://github.com/systemd/systemd/blob/04651f7564d48f86d79569b455f96c3b9b79562e/src/libsystemd/sd-bus/sd-bus.c#L3253

Seems to indicate you called "process" on a bus that was already closed.