ros-industrial / ros_canopen

CANopen driver framework for ROS (http://wiki.ros.org/ros_canopen)
GNU Lesser General Public License v3.0
328 stars 267 forks source link

NMT BootUp state transition is not noticed #435

Open PaulVerhoeckx opened 3 years ago

PaulVerhoeckx commented 3 years ago

On initialization the reset communication NMT command is send (node.cpp#L47) after which it will wait until the NMT state BootUp (=0) is reached. Occasionally this state is missed, due to the fast succession of the state PreOperational. Resulting in the following stack trace:

[ INFO] [1622561315.623773805]: Using fixed control period: 0.010000000
[ INFO] [1622561317.481039265]: Initializing...
[ INFO] [1622561317.482258080]: Current state: 1 device error: system:0 internal_error: 0 (OK)
[ INFO] [1622561317.483331782]: Current state: 2 device error: system:0 internal_error: 0 (OK)
[ INFO] [1622561327.493862171]: Current state: 2 device error: system:125 internal_error: 0 (OK)
[ INFO] [1622561327.494414179]: Current state: 2 device error: system:0 internal_error: 0 (OK)
[ INFO] [1622561327.495005929]: Current state: 0 device error: system:0 internal_error: 0 (OK)
[ INFO] [1622561327.495577834]: Current state: 0 device error: system:0 internal_error: 0 (OK)
[ERROR] [1622561327.496872067]: CAN not ready
[ERROR] [1622561327.497535175]: Initializing failed: could not reset node '1'

Using print statements I found the problem. A switch to NMT state 0 was made here, causing wait_until() to return. However, before the while condition was evaluated (here), state_ had already transitioned to PreOperational. This prevents breaking the while loop, again triggering wait_until(), which now times out.

As a workaround I made reset_com() wait on the state PreOperational, but this does not guarantee a reset of the node. If someone knows a good solution, I am willing to make a PR.

mathias-luedtke commented 3 years ago

Using print statements I found the problem. [...]

Add print statement changes the timing, I am not sure if this is the root cause.

Resulting in the following stack trace:

[ERROR] [1622561327.496872067]: CAN not ready
[ERROR] [1622561327.497535175]: Initializing failed: could not reset node '1'

It looks like your CAN bus stops working in between. (?)

As a workaround I made reset_com() wait on the state PreOperational

If I recall it correctly not all nodes send PreOperational (it depends on the heartbeat interval?) That's why the state get's set to PreOperational right after the BootUp message without further checks.

I think this might not happen with our devices, because the heartbeat interval is 0 per default and gets activated after the reset.

If someone knows a good solution, I am willing to make a PR.

I would add a special variable to track the BootUp message and check this instead of state_ in a wait_for_bootup function.