ros / ros_comm

ROS communications-related packages, including core client libraries (roscpp, rospy, roslisp) and graph introspection tools (rostopic, rosnode, rosservice, rosparam).
http://wiki.ros.org/ros_comm
752 stars 911 forks source link

`rosgraph.is_master_alive()` has unpredicable delay with certain URIs #2194

Open rr-tom-noble opened 2 years ago

rr-tom-noble commented 2 years ago

Issue

I've been having issues getting a remote node to detect when master is down (detailed here) which I've pinpointed down to the rosgraph.is_master_alive() call I'm using to check the connection. In short, certain IPs cause the method to take an awful long time compared to others.

Code

Below is the simplest example I've been able to create to reproduce the issue. Master is running at 127.0.1.1.

if __name__ == "__main__":
    rospy.init_node("test_node")
    uri = "http://192.168.0.1:11311"
    status = "Online!" if rosgraph.is_master_online(uri) else "Offline!"
    print(status)

Which yields a result of

rosrun test_package test.py
(2m 10s later...)
Offline!

uri = "http://127.0.0.1:11311" yields an instantaneous result of Online! uri = "http://1.2.3.4:11311" yields an instantaneous result of Offline!

After playing around with different IPs it seems that the issue occurs with IPs of the form 192.168.x.x, though I'm unsure if that range of IPs is specific to my machine.

Potential Solution

The workaround I'm currently using is to wrap the method call in a timeout.

@contextmanager
def timeout(time):
    signal.signal(signal.SIGALRM, raise_timeout)
    signal.alarm(time)
    try:
        yield
    except TimeoutError:
        pass
    finally:
        signal.signal(signal.SIGALRM, signal.SIG_IGN)

def raise_timeout(signum, frame):
    raise TimeoutError

def is_master_alive(uri, time):
    with timeout(time):
        return rosgraph.is_master_online(uri)

if __name__ == "__main__":
    rospy.init_node("test_node")
    uri = "http://192.168.0.1:11311"
    status = "Online!" if is_master_alive(uri, time=2) else "Offline!"
    print(status)

This feels like a lot of unnecessary code, so I think it'd be worthwhile to add an optional timeout parameter to rosgraph.is_master_alive(). The fact that the method returns at all suggests to me that there's some level of timeout already happening, but I think it'd be nice to allow it to be user-defined.

I'm happy to raise a PR if someone could point me in the right direction.

rr-tom-noble commented 2 years ago

After adding some prints, I've managed to narrow it down to this line

rosrun test_package test.py
Building Master object
Building master uri
Calling Master._reinit()
Parsing http host and port
Creating ServerProxy object
Created ServerProcy object
Making caller id
Built Master object
Getting PID