mavlink / mavros

MAVLink to ROS gateway with proxy for Ground Control Station
Other
847 stars 983 forks source link

Persistent network disconnection causes a crash #1918

Open JTomasi opened 8 months ago

JTomasi commented 8 months ago

Issue details

When running mavros via UDP and a network/radio disconnection event occurs, an intended retry in the libmavconn udp.cpp file can cause unintended, potentially dangerous behaviour.

Overview:

In the libmavconn/src/udp.cpp send_message() function, MAVROS is using boost's IO Service (specifically uses post() on lines 249 & 275) to make an async call of do_sendto(). The do_sento() function uses async_send_to() from the boost ip::udp::socket class to read from a message queue and send the data to an endpoint. During a prolonged network disconnection, the system can crash.

Specifics

When there is a network/radio disconnection during message transmission, async_send_to() returns an asio::error::network_unreachable error. This error is caught on line 324 where the issue is logged. For this particular type of error, the function does not return. Rather, the function continues to run and there is a comment in the code indicating that this is intended:

    if (error == asio::error::network_unreachable) {
        CONSOLE_BRIDGE_logWarn(PFXd "sendto: %s, retrying", sthis->conn_id, error.message().c_str());
        // do not return, try to resend
    } else if (error) {
        CONSOLE_BRIDGE_logError(PFXd "sendto: %s", sthis->conn_id, error.message().c_str());
        sthis->close();
        return;
    }

Since the message queue is not empty, execution continues. Since the network is disconnected, no bytes are transferred and the position in the message buffer is not modified. Execution continues to line 348, in which do_sendto() is recursively called. If the network does not reestablish, this causes a loop in which async_send_to() continues to error out with asio::error::network_unreachable and without any data being transmitted. This recalling of do_sendto() will loop continuously and basically hang. Eventually, the recursive mutex will reach the maximum number of locks that can be applied which will cause a crash, or, if we continue to call send_message(), other threads will wait until the recursive mutex is unlocked which will never happen. This will cause all threads to hang until eventually we run out of available threads.

If there is any misunderstanding in the behaviour of this function on my end, please let me know.

Questions

Is this an intentional design choice for Mavros?

Is there a better/safer way to handle this particular issue when the radio fails without continually waiting and crashing?

MAVROS version and platform

Mavros: 1.17.0 ROS: Noetic Ubuntu: 22.04