ros-industrial / ros_canopen

CANopen driver framework for ROS (http://wiki.ros.org/ros_canopen)
GNU Lesser General Public License v3.0
336 stars 271 forks source link

init after EMCY error #337

Closed mistoll closed 4 years ago

mistoll commented 5 years ago

Our system has a emergency switch which cuts the power to the motors. After switching the power off and on we want to be able to get the driver up again. Therefore we call shutdown and init. However that only works sometimes, but I can't tell what's influencing it.

Example:

  1. Start canopen_motor_node
  2. rosservice call /driver/init
  3. switch power off
  4. switch power on
  5. rosservice call /driver/shutdown
  6. rosservice call /driver/init

Services:

$ rosservice call /driver/init
success: True
message: ''
$ rosservice call /driver/shutdown 
success: True
message: ''
$ rosservice call /driver/init
success: False
message: "/tmp/binarydeb/ros-melodic-canopen-master-0.8.0/src/sdo.cpp(429): Throw in function\
  \ void canopen::SDOClient::transmitAndWait(const canopen::ObjectDict::Entry&, const\
  \ canopen::String&, canopen::String*)\nDynamic exception type: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<canopen::TimeoutException>\
  \ >\nstd::exception::what: SDO\n[canopen::tag_objectdict_key*] = 1800sub1\n"

Output:

[ INFO] [1560942776.528241122]: Using fixed control period: 0.160000000
[ INFO] [1560942779.848377433]: Initializing XXX
[ INFO] [1560942779.851680698]: Current state: 1 device error: system:0 internal_error: 0 (OK)
[ INFO] [1560942779.858183693]: Current state: 2 device error: system:0 internal_error: 0 (OK)
EMCY: 81#00310403000001F2
EMCY: 82#00310403000001F2
[ERROR] [1560942788.029041839]: Node has emergency error; Node has emergency error
error: 4
[ INFO] [1560942789.150166731]: Current state: 2 device error: system:0 internal_error: 4 (controller problems;)
[ INFO] [1560942789.152948146]: Current state: 1 device error: system:0 internal_error: 4 (controller problems;)
ID: 4
error: 4
ID: 4
Did not receive a response message
Did not receive a response message
Did not receive a response message
Did not receive a response message
[ INFO] [1560942793.292043838]: Shuting down XXX
[ERROR] [1560942798.107129103]: CAN not ready; heartbeat problem; heartbeat problem; not operational; not operational; CAN not ready
Did not receive a response message
Did not receive a response message
[ INFO] [1560942805.402526752]: Current state: 0 device error: system:0 internal_error: 4 (controller problems;)
[ INFO] [1560942805.404899560]: Current state: 0 device error: system:0 internal_error: 4 (controller problems;)
[ INFO] [1560942807.438436810]: Initializing XXX
[ INFO] [1560942807.440522840]: Current state: 0 device error: system:0 internal_error: 0 (OK)
[ INFO] [1560942807.442754363]: Current state: 1 device error: system:0 internal_error: 0 (OK)
[ INFO] [1560942807.447204410]: Current state: 2 device error: system:0 internal_error: 0 (OK)
[ INFO] [1560942807.452537958]: Current state: 2 device error: system:125 internal_error: 0 (OK)
[ INFO] [1560942807.455290748]: Current state: 1 device error: system:125 internal_error: 0 (OK)
[ERROR] [1560942808.251477911]: CAN not ready; CAN not ready
Did not receive a response message
Did not receive a response message
Did not receive a response message
[ INFO] [1560942810.485750699]: Current state: 1 device error: system:0 internal_error: 0 (OK)
[ INFO] [1560942810.487097638]: Current state: 0 device error: system:0 internal_error: 0 (OK)
[ INFO] [1560942810.489969340]: Current state: 0 device error: system:0 internal_error: 0 (OK)
^CSegmentation fault

candump.txt

mathias-luedtke commented 5 years ago

Our system has a emergency switch which cuts the power to the motors.

Do you power off the logic as well? Do you restart the CAN device or do you specify restart-ms?

The "CAN not ready" messages after init look suspicious.

mistoll commented 5 years ago

Do you power off the logic as well?

Yes

Do you restart the CAN device or do you specify restart-ms?

Neither. But I'll try.

The "CAN not ready" messages after init look suspicious.

I tried do debug that but without much success. The error starts at https://github.com/ros-industrial/ros_canopen/blob/67172327d83e6308172b09d21916282cb23b99f1/socketcan_interface/include/socketcan_interface/asio_base.h#L71 and 125 is OPERATION_CANCELED. That's all I found out.

mistoll commented 5 years ago

Do you restart the CAN device or do you specify restart-ms?

Neither. But I'll try.

Tried ifup/down and restart-ms with no change in result at all.

mathias-luedtke commented 5 years ago

Please check the error counters as well:

ip -details -statistics link show can0

I guess that the heartbeat (or the SYNC) gets sent to the offline device and fills up the CAN buffer. However, I could not spot this in your candump. To log everthing, you should use:

candump any,0:0,#FFFFFFFF
mistoll commented 5 years ago
$: ip -details -statistics link show can0
2: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 10
    link/can  promiscuity 3068960324 
    can state ERROR-ACTIVE (berr-counter tx 126 rx 0) restart-ms 1000 
      bitrate 1000000 sample-point 0.750 
      tq 83 prop-seg 4 phase-seg1 4 phase-seg2 3 sjw 1
      c_can: tseg1 2..16 tseg2 1..8 sjw 1..4 brp 1..1024 brp-inc 1
      clock 24000000
      re-started bus-errors arbit-lost error-warn error-pass bus-off
      0          0          0          4          4          0         numtxqueues 3068960324 numrxqueues 3068960324 gso_max_size 3068960324 gso_max_segs 3068960324 
    RX: bytes  packets  errors  dropped overrun mcast   
    12839      3055     0       0       0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    5302       985      0       0       0       0      

It's also worth mentioning that if I restart the node after it failed the first init works. So I assume it's some state in the node causing that error.

I used

candump any,0:0,#FFFFFFFF

to record two candumps. candump.txt reproduces the error, while in candump_working.txt I try to reproduce the error but it works as expected. I can't tell, what's the difference, it's somehow indeterministic. Sometimes it even works over and over again and only fails after rebooting the system. But on other days it won't work a single time.

mistoll commented 4 years ago

Closing this issue, because the problem didn't happen for a long time and the there's no reliable reproduction.