cob4-5 arm right in quick stop

floweisshardt commented 7 years ago

after resetting the arm reference position with the Schunk MTS tool, we experience strange behaviour of the arm (not sure if it was there before too).

I sometimes can init, but then the modules show quick stop or emergency error. So it is impossible to move the arm. Recovering says "True" and in diagnostics you can very shortly see OK, but the quick stop or emergency stop is active again. see candump https://gist.github.com/ipa-fmw/d2355a5ad93af345ccf23b35637adb63

Most of the time I cannot init. Either it says:

robot@cob4-5-b1:~$ rosservice call /arm_right/driver/init
success: False
message: Internal limit active

or

robot@cob4-5-b1:~/git/setup_cob4/cob-pcs$ rosservice call /arm_right/driver/init
success: False
message: /u/robot/git/care-o-bot/src/ros_canopen/canopen_master/src/sdo.cpp(429): Throw in function void canopen::SDOClient::transmitAndWait(const canopen::ObjectDict::Entry&, const canopen::String&, canopen::String*)
Dynamic exception type: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<canopen::TimeoutException> >
std::exception::what: SDO
[canopen::tag_objectdict_key*] = 1a02sub0

robot@cob4-5-b1:~/git/setup_cob4/cob-pcs$ rosservice call /arm_right/driver/init
success: False
message: /u/robot/git/care-o-bot/src/ros_canopen/canopen_master/src/sdo.cpp(429): Throw in function void canopen::SDOClient::transmitAndWait(const canopen::ObjectDict::Entry&, const canopen::String&, canopen::String*)
Dynamic exception type: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<canopen::TimeoutException> >
std::exception::what: SDO
[canopen::tag_objectdict_key*] = 606bsub0

see candump https://gist.github.com/ipa-fmw/13ec36be17b7af4bebf7f9862bf394d3

arm config is:

rosparam get /arm_right/
arm_right_1_joint_position_controller: {joint: arm_right_1_joint, required_drive_mode: 1,
  type: position_controllers/JointPositionController}
arm_right_1_joint_velocity_controller: {joint: arm_right_1_joint, required_drive_mode: 2,
  type: velocity_controllers/JointVelocityController}
arm_right_2_joint_position_controller: {joint: arm_right_2_joint, required_drive_mode: 1,
  type: position_controllers/JointPositionController}
arm_right_2_joint_velocity_controller: {joint: arm_right_2_joint, required_drive_mode: 2,
  type: velocity_controllers/JointVelocityController}
arm_right_3_joint_position_controller: {joint: arm_right_3_joint, required_drive_mode: 1,
  type: position_controllers/JointPositionController}
arm_right_3_joint_velocity_controller: {joint: arm_right_3_joint, required_drive_mode: 2,
  type: velocity_controllers/JointVelocityController}
arm_right_4_joint_position_controller: {joint: arm_right_4_joint, required_drive_mode: 1,
  type: position_controllers/JointPositionController}
arm_right_4_joint_velocity_controller: {joint: arm_right_4_joint, required_drive_mode: 2,
  type: velocity_controllers/JointVelocityController}
arm_right_5_joint_position_controller: {joint: arm_right_5_joint, required_drive_mode: 1,
  type: position_controllers/JointPositionController}
arm_right_5_joint_velocity_controller: {joint: arm_right_5_joint, required_drive_mode: 2,
  type: velocity_controllers/JointVelocityController}
arm_right_6_joint_position_controller: {joint: arm_right_6_joint, required_drive_mode: 1,
  type: position_controllers/JointPositionController}
arm_right_6_joint_velocity_controller: {joint: arm_right_6_joint, required_drive_mode: 2,
  type: velocity_controllers/JointVelocityController}
arm_right_7_joint_position_controller: {joint: arm_right_7_joint, required_drive_mode: 1,
  type: position_controllers/JointPositionController}
arm_right_7_joint_velocity_controller: {joint: arm_right_7_joint, required_drive_mode: 2,
  type: velocity_controllers/JointVelocityController}
driver:
  bus: {device: can2}
  defaults:
    dcf_overlay: {604Csub1: '1', 604Csub2: '24000'}
    eds_file: common/Schunk_0_63.dcf
    eds_pkg: cob_hardware_config
    vel_to_device: rint(rad2deg(vel)*250)
  nodes:
  - {id: 51, name: arm_right_1_joint}
  - {id: 52, name: arm_right_2_joint}
  - {id: 53, name: arm_right_3_joint}
  - {id: 54, name: arm_right_4_joint}
  - {id: 55, name: arm_right_5_joint}
  - {id: 56, name: arm_right_6_joint}
  - {id: 57, name: arm_right_7_joint}
  sync: {interval_ms: 10, overflow: 0}
joint_group_position_controller:
  joints: [arm_right_1_joint, arm_right_2_joint, arm_right_3_joint, arm_right_4_joint,
    arm_right_5_joint, arm_right_6_joint, arm_right_7_joint]
  required_drive_mode: 1
  type: position_controllers/JointGroupPositionController
joint_group_velocity_controller:
  joints: [arm_right_1_joint, arm_right_2_joint, arm_right_3_joint, arm_right_4_joint,
    arm_right_5_joint, arm_right_6_joint, arm_right_7_joint]
  required_drive_mode: 2
  type: velocity_controllers/JointGroupVelocityController
joint_names: [arm_right_1_joint, arm_right_2_joint, arm_right_3_joint, arm_right_4_joint,
  arm_right_5_joint, arm_right_6_joint, arm_right_7_joint]
joint_state_controller: {publish_rate: 50, type: joint_state_controller/JointStateController}
joint_trajectory_controller:
  action_monitor_rate: 10
  constraints:
    arm_right_1_joint: {goal: 0.1, trajectory: 0.1}
    arm_right_2_joint: {goal: 0.1, trajectory: 0.1}
    arm_right_3_joint: {goal: 0.1, trajectory: 0.1}
    arm_right_4_joint: {goal: 0.1, trajectory: 0.1}
    arm_right_5_joint: {goal: 0.1, trajectory: 0.1}
    arm_right_6_joint: {goal: 0.1, trajectory: 0.1}
    arm_right_7_joint: {goal: 0.1, trajectory: 0.1}
    goal_time: 0.6
    stopped_velocity_tolerance: 0.05
  joints: [arm_right_1_joint, arm_right_2_joint, arm_right_3_joint, arm_right_4_joint,
    arm_right_5_joint, arm_right_6_joint, arm_right_7_joint]
  required_drive_mode: 7
  state_publish_rate: 25
  stop_trajectory_duration: 0.5
  type: position_controllers/JointTrajectoryController
max_command_silence: 0.5

diagnostics screenshot arm_right_quick_stop

floweisshardt commented 7 years ago

@ipa-mdl: any idea?

floweisshardt commented 7 years ago

@ipa-uhr: FYI

mathias-luedtke commented 7 years ago

Quickstop means that the chain halted because of a failure with one module (or as a result of a service call to halt).

message: Internal limit active

Module reached an internal (end) limit: if the hardware was aligned correctly then the (MTS) config is broken.

Not sure about the SDO errors, do you use the latest overlay? Update: Both relate to velocity_demand_value

floweisshardt commented 7 years ago

when updating the reference position we did not modify anything expect the baudrate and canID as it was reset by the MTS tool. Can I simply copy the config from the left arm?

Can you see which joint throws the internal limit? We only resetted module 1 (ID: 51)

floweisshardt commented 7 years ago

here's another series of calls to init

test@cob4-5-b1:~$ rosservice call /arm_right/driver/init
success: False
message: /u/robot/git/care-o-bot/src/ros_canopen/canopen_master/src/sdo.cpp(429): Throw in function void canopen::SDOClient::transmitAndWait(const canopen::ObjectDict::Entry&, const canopen::String&, canopen::String*)
Dynamic exception type: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<canopen::TimeoutException> >
std::exception::what: SDO
[canopen::tag_objectdict_key*] = 1600sub1

test@cob4-5-b1:~$ rosservice call /arm_right/driver/init
success: False
message: /u/robot/git/care-o-bot/src/ros_canopen/canopen_master/src/sdo.cpp(429): Throw in function void canopen::SDOClient::transmitAndWait(const canopen::ObjectDict::Entry&, const canopen::String&, canopen::String*)
Dynamic exception type: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<canopen::TimeoutException> >
std::exception::what: SDO
[canopen::tag_objectdict_key*] = 6061sub0

test@cob4-5-b1:~$ rosservice call /arm_right/driver/init
success: False
message: /u/robot/git/care-o-bot/src/ros_canopen/canopen_master/src/sdo.cpp(429): Throw in function void canopen::SDOClient::transmitAndWait(const canopen::ObjectDict::Entry&, const canopen::String&, canopen::String*)
Dynamic exception type: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<canopen::TimeoutException> >
std::exception::what: SDO
[canopen::tag_objectdict_key*] = 607asub0

test@cob4-5-b1:~$ rosservice call /arm_right/driver/init
success: True
message: ''

and recover says:

test@cob4-5-b1:~$ rosservice call /arm_right/driver/recover 
success: False
message: Could not set transition; Could not enable motor
test@cob4-5-b1:~$ rosservice call /arm_right/driver/recover 
success: True
message: ''

mathias-luedtke commented 7 years ago

Can I simply copy the config from the left arm?

I don't know, I have never configured the arms, ask @ipa-tif

Can you see which joint throws the internal limit?

Node 53. But it is just a single message which looks like corrupt data. I believe it did not stop because of the internal limit flag (it is ignored during initialization), but because of a sudden "mode switch" to 255/-1 (instead of 0)

Can you provide me a dump of a SDO error at initialization?

floweisshardt commented 7 years ago

@ipa-mdl: it seems as I cannot lock it down to only one error message (quick stop, emergency stop, transition, enable motor, internal limit). also the SDO erros are quite numerous (607asub0, 6061sub0, 1600sub1, 606bsub0, 1a02sub0, ...). I think I need a good hint how to lock down the error or what module might have the wrong configuration or what module is broken...

floweisshardt commented 7 years ago

candump with SDO error at init is here: https://gist.github.com/ipa-fmw/13ec36be17b7af4bebf7f9862bf394d3

mathias-luedtke commented 7 years ago

Is this one single candump or multiple files? I cannot open the website properly..

floweisshardt commented 7 years ago

is is only for can2 (arm_right) on cob4-5-t1

mathias-luedtke commented 7 years ago

is is only for can2 (arm_right) on cob4-5-t1

I meant: did you upload multiple files for each failure or just one. I managed to download it as zip file and it is just one file.

floweisshardt commented 7 years ago

this is just one file per gist link.

mathias-luedtke commented 7 years ago

this is just one file per gist link.

I missed the second link ;)

mathias-luedtke commented 7 years ago

it seems as I cannot lock it down to only one error message

Node 53 is sending an unexpected response. I am not sure which part of it forces the driver to quick-stop - either internal limit or the "mode switch". This might happen while the driver expects a response (in another thread) from another node, but was interrupted by the quick-stop. Do not hunt the SDO errors, they are just a symptom (=last error message, they do not accumulate).

It look like a bug in the schunk firmware, it looks like the actual position was sent as status word.

 (1473066397.980450)  can2  3B5   [8]  61 CD FE FF 00 00 00 00
 (1473066397.981448)  can2  1B5   [3]  61 CD FE

['(1473066397.980450)', 'can2', 'TPDO3', 53, 'actual_position', 'FFFECD61', 'actual_velocity', '00000000']
['(1473066397.981448)', 'can2', 'TPDO1', 53, 'status_word', 'CD61', 'op_mode_display', 'FE']

floweisshardt commented 7 years ago

ok, so what do you recomend? Can we do anything on the ROS/Canopen level? or should I reset the firmware?

are you sure it is module 54 which causes all the trouble?

mathias-luedtke commented 7 years ago

are you sure it is module 54 which causes all the trouble?

No! Node 53.

ok, so what do you recomend? Can we do anything on the ROS/Canopen level? or should I reset the firmware?

Ask Schunk if this is a known issue with the old firmware (or whatever version the arm is running)

floweisshardt commented 7 years ago

I contacted Schunk and send the candumps to F. Friedrichs. But unfortunately their firewall prevents them from viewing github pages!

SCHUNK-Osswald commented 7 years ago

Hello, this is Dirk Osswald from SCHUNK (some people here are allowed to access github...)

I'm going to analyze your logfiles, but that will take some time (preparing the files for CANoe, finding the interesting parts,...). Right now I cannot relieably guess how long that will take, but I fear it will be more than an hour. In the meantime you could provide me with some general information:

Which CANopen Node-IDs are used and which devices are that? (53/54 are not standard SCHUNK LWA4 IDs, so I need to know which is PRL+ (shoulder) and which is ERB145 / ERB 115)
What PDO mapping do you use? The default one or do you change the mapping? In the latter case which objects are mapped where?
other infos: Baudrate, ...

floweisshardt commented 7 years ago

This is a Care-O-bot 4 arm. So we're using a PRL+100 (SerialNo 0031072, CanID 51) as first module and then three ERB (CanIDs: 52/53, 54/55, 56/57).
Baudrate is 500k
you can find our dcf file at https://github.com/ipa320/cob_robots/blob/indigo_dev/cob_hardware_config/common/Schunk_0_63.dcf

floweisshardt commented 7 years ago

@ipa-mdl: do you have any more CANOpen related infos for Dirk?

mathias-luedtke commented 7 years ago

Readable PDO mapping: https://github.com/ipa320/canopen_test_utils/blob/indigo-devel/scripts/schunk_mapping.py

Readable candumps: https://gist.github.com/ipa-mdl/fa95d202dd67eca9a4d32a446109eda1

@SCHUNK-Osswald: please note we use as velocity scaling of 1/250 deg/s to fit into the 16 bit command

SCHUNK-Osswald commented 7 years ago

Just a quick query: I got 2 separate log files from Frank Friedrichs: arm_right_quick_stop and arm_right_quick_stop2

Whats the difference / story / symptomatic behind? Is one file the "ok-case" and the other the "failure" case?

floweisshardt commented 7 years ago

I just saw that I get the following error while beeing connected through USB to the PRL in the MTS Software:

Modul 10 wird initalisiert
Modul 10 antwortet GET_CONFIG -> Modulinformation
    -> Modultyp: PRL+ 100
    -> Bestellnummer: 306922
    -> Software Version: 0.63
    -> Protokoll Version: 3
    -> Platine Version: 6.11
    -> 2013-07-04 09:53:54 PTAo
Modul 10 antwortet GET_CONFIG -> Konfiguration: Falsche Daten CRC!

Is this ok or might that be a hint for the error?

floweisshardt commented 7 years ago

@SCHUNK-Osswald: the two files are related to the output shown in the original post of this ticket (links to gist). In both cases it was not possible to move the modules.

mathias-luedtke commented 7 years ago

@SCHUNK-Osswald: just different tries, the SDO error are just consequential, the major problem is:

(1473066397.980450)  can2  3B5   [8]  61 CD FE FF 00 00 00 00 # actual positon and velocity
(1473066397.981448)  can2  1B5   [3]  61 CD FE # status word and operation mode

SCHUNK-Osswald commented 7 years ago

OK, thx - I will eventually have to read this longish thread then ;-)

Your question regarding the wrong Data CRC is most likely not a good hint at the error. But I will take that into cosideration anyway.

mathias-luedtke commented 7 years ago

@ipa-fmw: You could test the arm without node 53, just to make sure

floweisshardt commented 7 years ago

running without 53 doesn't solve the problem. here's the candump with one time "init" and one time "recover". Both calls succeeded but after approximately 1sec all modules report an error again

Could not read error error_register

floweisshardt commented 7 years ago

if I randomly uncomment just one module, I cannot initialize the chain. If I randomly uncomment 3-4 modules (that means 2-4 are active) the chances are better to be able to initialize. Even though the chain is principally working, it is most often not longer than 10sec working without hearing the brakes switch in and recovering is needed.

floweisshardt commented 7 years ago

already midnight in Korea, so I'll give up for today. Thanks to @ipa-mdl and @SCHUNK-Osswald to help tracking down this issue.

As this is not resolved, we unfortunatelly have to solve this remotely.

mathias-luedtke commented 7 years ago

(https://gist.github.com/ipa-fmw/c6494f033767df99a19aa6822fef56a9)

running without 53 doesn't solve the problem

All nodes report EMCY 8250 "RPDO Timeout", and the candump confirms that no data was sent out. Did you see a high CPU/IO load from other ROS nodes?

All these candumps look a little bit suspicious as all incoming messages have the same timestamp. Not sure what's going on. I would test it on another PC with a different CAN dongle.

floweisshardt commented 7 years ago

CPU load is <5% on all 4 cores. Arm left is running on same pc, same can card, same ROS-configuration and is working without problems.

SCHUNK-Osswald commented 7 years ago

I have some observations / further queries to make:

1 candumps do look suspicious to me as well:

(1473066364.289796) can2 3B3 [8] D9 D6 00 00 00 00 00 00 (1473066364.289796) can2 3B4 [8] E3 53 00 00 00 00 00 00 (1473066364.289796) can2 1B4 [3] 60 92 00 (1473066364.289797) can2 3B5 [8] 61 CD FE FF 00 00 00 00 (1473066364.298044) can2 080 0 can2 1B4 [3] 60 92 00 (1473066364.299549) can2 1B3 [3] 60 92 00 (1473066364.299549) can2 3B3 [8] D9 D6 00 00 00 00 00 00 (1473066364.299549) can2 3B4 [8] E3 53 00 00 00 00 00 00 (1473066364.299550) can2 3B5 [8] 61 CD FE FF 00 00 00 00 (1473066364.300527) can2 1B5 [3] 60 92 00 (1473066364.307671) can2 5B6 [8] 60 17 10 00 00 00 00 00 (1473066364.307773) can2 636 [8] 40 00 18 01 00 00 00 00 (1473066364.307912) can2 080 0 can2 5B6 [8] 43 00 18 01 B6 01 00 00 (1473066364.308899) can2 636 [8] 23 00 18 01 B6 01 00 80 (1473066364.309796) can2 1B3 [3] 60 92 00 (1473066364.309796) can2 3B3 [8] 1C DC D9 D6 00 00 00 00 (1473066364.309796) can2 3B4 [8] E3 53 00 00 00 00 00 00`

The parts marked in bold are the actual_position of the PRL100 module. This is at 55001 (0x0000d6d9) all the time, except for single spikes 3604601884 (0xd6d9dc1c). The hex bytes seem to indicate that the low-word of the bytes of the position got shifted into the high-word and the low word filled with random bytes / garbage.

2 The statusword 0x6041/0 and mode of operation display 0x6061/0 have weired outliers as well:

(1473066417.489351) can2 1B3 [3] 60 92 00 (1473066417.489474) can2 633 [8] 40 01 10 00 00 00 00 00 (1473066417.490497) can2 1B4 [3] 07 92 00 (1473066417.490497) can2 3B3 [8] D9 D6 00 00 00 00 00 00 (1473066417.490497) can2 3B4 [8] E3 53 00 00 00 00 00 00 (1473066417.490498) can2 3B5 [8] 61 CD FE FF 00 00 00 00 (1473066417.491496) can2 1B5 [3] 27 92 00 (1473066417.491496) can2 3B6 [8] E0 8D FE FF 00 00 00 00 (1473066417.491497) can2 1B6 [3] 07 82 00 (1473066417.491497) can2 3B7 [8] 33 15 FF FF 00 00 00 00 (1473066417.491497) can2 1B7 [3] 07 82 00 (1473066417.492476) can2 3B8 [8] 16 64 FF FF 00 00 00 00 (1473066417.492477) can2 1B8 [3] 07 82 00 (1473066417.492477) can2 3B9 [8] 78 EF 00 00 00 00 00 00 (1473066417.492477) can2 1B9 [3] 07 82 00 (1473066417.493350) can2 5B3 [8] 4F 01 10 00 00 00 00 00 (1473066417.493496) can2 634 [8] 40 01 10 00 00 00 00 00 (1473066417.494492) can2 5B4 [8] 4F 01 10 00 00 00 00 00 (1473066417.494612) can2 635 [8] 40 01 10 00 00 00 00 00 (1473066417.495496) can2 5B5 [8] 4F 01 10 00 00 00 00 00 (1473066417.495598) can2 636 [8] 40 01 10 00 00 00 00 00 (1473066417.496496) can2 5B6 [8] 4F 01 10 00 00 00 00 00 (1473066417.496611) can2 637 [8] 40 01 10 00 00 00 00 00 (1473066417.497495) can2 5B7 [8] 4F 01 10 00 00 00 00 00 (1473066417.497611) can2 638 [8] 40 01 10 00 00 00 00 00 (1473066417.498495) can2 5B8 [8] 4F 01 10 00 00 00 00 00 (1473066417.498611) can2 639 [8] 40 01 10 00 00 00 00 00 (1473066417.498847) can2 080 0 can2 5B9 [8] 4F 01 10 00 00 00 00 00 (1473066417.500496) can2 1B3 [3] 4F 01 10

The obviously wrong data from the TPD1 of node 51 in the last line is a copy of the SDO request (!) sent to all modules just before.

3 Where is the mode of operation set for the modules? I see that you get the current mode of operation via tpdo1 (mode of operation display 0x6061/0), but I cannot find the setting of the mode of operation 0x6061/0 (e.g. via SDO). Is this just not contained in the logs due to the error?

mathias-luedtke commented 7 years ago

Where is the mode of operation set for the modules?

The driver starts in "no mode" state per default. The mode will be set later, but the driver did not get that far. (For newer firmware version the mode can be set beforehand)

I don't think that this behaviour results from a genuine bug in the firmware or the ROS driver. In addition more than one module if affected.

@ipa-fmw: did you touch the end resistor dip switches?

SCHUNK-Osswald commented 7 years ago

I just analyzed the log files a little more and there are numerous of these "outlier" data for all nodes. Data of Node 53 e.g. somtimes has the position bytes in the place of the velocity bytes while the position bytes are filled with garbage.

The attached image shows the reported positions over time of all nodes from the arm_right_quick_stop2 log as a 2D-plot. position_outliers But actually I do not really trust this log data very much. Since if our firmware did produce such invalid CAN messages with such a high frequency it would not be useable in any way.

So at the moment I do not think that I can be of much help any more - unless of course you have further questions.

floweisshardt commented 7 years ago

in another test today we figured out that

switching can interfaces makes no difference (left arm works on all can devices, right arm does not work on all can devices)
switching cables between can card and first module of the arm does not make a difference
it is still possible to move all joints individually and in small groups (e.g. 3 modules together). But the more joints we're using it is more unlikely to be able to init or recover and move. Although it was possible to move the while arm (7DOF) for a short time, it's still very unreliable.

fmessmer commented 7 years ago

I guess we can close this as the arm was broken anyway and is currently being repaired...@ipa-fmw @ipa-nhg @ipa-tif

floweisshardt commented 7 years ago

let's hope that repairing helps with the issue above. I'll reopen if it does not fix the issue.

ipa320 / cob_robots

cob4-5 arm right in quick stop #533