ipa320 / schunk_robots

www.schunk-modular-robotics.com
29 stars 50 forks source link

Errors 'CAN not ready' when init Schunk lwa4p in Indigo #59

Closed Zunchao closed 6 years ago

Zunchao commented 8 years ago

Hi, I am trying to move the schunk lwa4p arm using the package of indigo version. But errors like 'CAN not ready' 'message: cannot reset node ' '' frequently appear when I call the service /arm/driver/init.

The configuration status is:

-Ubuntu 14.04 64-bit -ROS-Indigo (the most recent update) -SocketCAN(peak_usb) -The schunk_robots and ros_canopen repositories are the most recent ones.

The program running order:

rosrun canopen_test_utils prepare.sh; roslaunch schunk_lwa4p robot.launch; rosservice call /arm/driver/init(OR roslaunch schunk_lwa4p dashboard.launch).

I considered it as an issue of time sequence, as I found that the Sync, SDO and PDO running simutaneously like chaos. So I delayed the sync starting time after the SDO initialization finished(not completely, except the last node), by adding a conditional flag in the function of switchState() in the node.cpp in the package of canopen_master, as follows:

...... int delay_flag=0; ...... case Operational: delay_flag++; if (delay_flag==6) { delayflag=0; if(changed && sync) sync_->addNode(this);} break; ......

The current status is: as long as I have tested, it fails at the first one or two tries(relaunch the robot node) every time I restart the arm, but after that it works stably(no errors and can move the arm).

Errors in the first one or two tries:

...... core service [/rosout] found process[arm/robot_state_publisher-1]: started with pid [14627] process[arm/driver-2]: started with pid [14652] process[arm/arm_controller_spawner-3]: started with pid [14687] process[arm/cob_control_mode_adapter_node-4]: started with pid [14700] [ INFO] [1445943599.536288009]: waitForService: Service [/arm/controller_manager/load_controller] has not been advertised, waiting... process[arm/joint_states_relay-5]: started with pid [14722] [ INFO] [1445943604.536369237]: waitForService: Service [/arm/controller_manager/load_controller] has not been advertised, waiting... [ INFO] [1445943604.598089341]: Initializing XXX [ INFO] [1445943604.598304076]: Current state: 1 device error: system:0 internal_error: 0 (OK) [ INFO] [1445943604.598407122]: Current state: 2 device error: system:0 internal_error: 0 (OK) [ INFO] [1445943609.557082529]: waitForService: Service [/arm/controller_manager/load_controller] has not been advertised, waiting... [ INFO] [1445943614.558829524]: waitForService: Service [/arm/controller_manager/load_controller] has not been advertised, waiting... [ INFO] [1445943617.524258383]: waitForService: Service [/arm/controller_manager/load_controller] is now available. [ INFO] [1445943617.609265224]: joint_trajectory_controller loaded [ INFO] [1445943617.659285657]: joint_group_position_controller loaded [ INFO] [1445943617.669156610]: joint_group_interpol_position_controller loaded [ INFO] [1445943617.719205008]: joint_group_velocity_controller loaded Loaded joint_state_controller [ INFO] [1445943617.779181358]: Switched Controllers. From no_stop_controller_defined to joint_trajectory_controller Started ['joint_state_controller'] successfully [ INFO] [1445943618.011741385]: advertised as /joint_states

[arm/arm_controller_spawner-3] process has finished cleanly log file: /home/zheng/.ros/log/68412552-7c99-11e5-955b-901b0e5c82f2/arm-arm_controller_spawner-3*.log error: 4 [ INFO] [1445943618.682338510]: Current state: 2 device error: system:0 internal_error: 4 (controller problems;) [ INFO] [1445943618.682366799]: Current state: 1 device error: system:0 internal_error: 4 (controller problems;) ID: 4 [ERROR] [1445943618.689081715]: CAN not ready; CAN not ready Did not receivce a response message Did not receivce a response message ......

The question here is: I am not sure whether what I have understood and changed is reasonable or not. The fact is it truly increases the probability of successful initialization. Could anyone give me some instructions about it? Or any other method to solve it? Thanks very much.

fmessmer commented 8 years ago

I guess you are using debians, right? As far as I know quite some fixes have been applied to the ros_canopen source.

It might also be related to https://github.com/ros-industrial/ros_canopen/issues/102 @ipa-mdl

mathias-luedtke commented 8 years ago

I considered it as an issue of time sequence, as I found that the Sync, SDO and PDO running simutaneously like chaos

What do you mean by "chaos"? The time sequence is well defined per CANopen node. Sync starts after the first CANopen node is operational. (It could be even started by another process..)

The currently released version has some bugs.. Please use the source version. I guess you experience https://github.com/ros-industrial/ros_canopen/pull/108.

I will release a new version after https://github.com/ros-industrial/ros_canopen/pull/150 was tested and merged.

mathias-luedtke commented 8 years ago

And @ipa-fxm is right with https://github.com/ros-industrial/ros_canopen/issues/102. Please increase the queue length.

Zunchao commented 8 years ago

@ipa-mdl Thank you for your reply. I am sorry I did not explain clearly. What I mean is,

I traced the communication using PCAN View in Windows, the data are in the tracedata1.xls, where 0908(sheet 1) shows trace by original package, 0924(sheet 2) is the trace by my modified package.

According to the canopen protocol and the means of IDs, in sheet 0908, we can see the configuration for node 3 starts firstly (the yellow 17th raw), and the sync(the green 108 raw) begins immediately after the start of configuration of next node 4(the yellow 107th raw), then the PDO for node 3 and SDO for node 4 are running at the same time(this is what I mean chaos). Actually I don't think it is a necessary time to start the sync(why cannot sync starts after all nodes are operational), so what I did is postpone the running time of sync, like sheet 0928, the sync starts after finishing configuration of all the 6 nodes(green 606th raw).

One more question is, as we can see form the light blue parts in sheet 0908, the nodes which have been configured are reset again from the first node(node 3 in this example). Why all the former nodes are reset repeatedly again and again when configuring the current node. It is just an doubt, as I am not very familiar to the canopen mechanism, I am still knowing nothing where the sync defined in the program, I am learning. Could you give me some advice if You get an answer.

Thanks.

NOTE: I donnot know to attach thetracedata1.xlsx here, so I send it to you by email, please check it.

Zunchao commented 8 years ago

@ipa-fxm Thank you for your reply.

1, I am not sure about what do you mean by 'using debians'. I am using Ubuntu, and the indigo-dev-packages ros_canopen and schunk_robots are built in catkin workspace by git clone the github links, so I can modify it.

2, I have changed the value of queue length to 15, 20, and other numbers, but it seems no obvious differences as the results of default value 10. The thing is, how do I know which value for the txqueuelen is the most suitable one.

3, I also tested in the lowlatency kernel of Ubuntu, it is the same errors and similar performance.

Thanks.

mathias-luedtke commented 8 years ago

ros_canopen and schunk_robots are built in catkin workspace by git clone the github links

Then you are using the source instead of the "debians" (sudo apt-get install ros-indigo-ros-canopen). That's fine if you use the latest versions.

The thing is, how do I know which value for the txqueuelen is the most suitable one.

It should be greater than the number of sent packages per SYNC plus some buffer for SDOs. 15-20 should be suitable for a LWA4P.

then the PDO for node 3 and SDO for node 4 are running at the same time(this is what I mean chaos)

This is the intended behaviour. What's wrong with it?

Actually I don't think it is a necessary time to start the sync

For the initialization of the 402 layer some data needs to be transferred via PDOs. It is not necessary, but see next answer as well.

why cannot sync starts after all nodes are operational

  1. There is no need to
  2. It complicates the start-up logic since the nodes have to wait for each other
  3. The sync could be provided by another process and already be running

One more question is, as we can see form the light blue parts in sheet 0908, the nodes which have been configured are reset again from the first node(node 3 in this example).

Instead of the colour can you list the timestamp or line number, please? It is a long document..

I am still knowing nothing where the sync defined in the program, I am learning. Could you give me some advice if You get an answer.

What do you mean by "defined"? The sync rate is specified in the sync/interval_ms parameter. And the behaviour is implemented in the sync layer (http://wiki.ros.org/canopen_master#Master_plug-ins)

mathias-luedtke commented 8 years ago

Back to your init problem: The nodes were initialized properly. The init service should have responded with success=True. However, the CAN layer stopped right after the drive mode was switched.

A txqueuelen of 20 should work for you. Please double check the value with ifconfig or ip link.

The CAN trace looks fine so far. Did it really end at the last message? Can you create a CAN dump on your ROS PC (candump -l can0)?

Zunchao commented 8 years ago

@ipa-mdl Thanks for your quick reply, give me some time I'll check and study the answers one by one.

Zunchao commented 8 years ago

@ipa-mdl According to Your advice, I make the queue length to 20, and use the latest indigo-version package of ros_canopen. The test results seems no obvious differences compared with that of txqueuelen 10. As before, the first try fails after restart the arm, the init status shows:

...
Loaded joint_state_controller
[ INFO] [1446625861.564122127]: joint_group_position_controller loaded
Started ['joint_state_controller'] successfully
[ INFO] [1446625861.584114192]: joint_group_interpol_position_controller loaded
[ INFO] [1446625861.634363844]: joint_group_velocity_controller loaded
[ INFO] [1446625861.704202641]: Switched Controllers. From no_stop_controller_defined to joint_trajectory_controller
[arm/arm_controller_spawner-3] process has finished cleanly
log file: /home/zheng/.ros/log/1c70c6ac-8237-11e5-9ff9-901b0e5c82f2 arm-arm_controller_spawner-3*.log
[ INFO] [1446625861.726525174]: advertised as /joint_states
error: 4
[ INFO] [1446625861.936730458]: Current state: 2 device error: system:0 internal_error: 4 (controller problems;)
[ INFO] [1446625861.936758030]: Current state: 1 device error: system:0 internal_error: 4 (controller problems;)
ID: 4
[ERROR] [1446625861.943820703]: CAN not ready; CAN not ready
Did not receivce a response message
Did not receivce a response message
...

and the init service responds:

success: True message: ''

It seems a break after success init, the candump file is candump-2015-11-04_092135.log. The thing is when you relaunch the robot node, it works well. But if you restart the arm, same results: first init fails, after that no errors.

I am still not clearly understanding the mechanism of the initialization. For a clearer view, I postponed(deleted) the sync and PDO parts, data as in candump-2015-11-04_093045.log. For example, from Line 475 to Line 499, it's a switch state process, the former node (node 7) goes to 'operational', then a two-stage follows (stage one: Line 476 ~ Line 483; stage two: Line 484 ~ Line 499), the first stage finishes the former five operational nodes? The second stage seems to start the bootup of the next node and reset the former five nodes? Is that right? If it is, can we bootup the new node directly, without reset the former nodes? Because it seems the messages in the 1st stage duplicated again in the 2nd stage.

-----------The two log files emailed.---------

as suggested, I attached both the files in gist.github candump-2015-11-04_092135.log shows data with error. (https://gist.github.com/Zunchao/8d03cb1a1a1e2f6fbdfe#file-candump-2015-11-04_092135) candump-2015-11-04_093045.log shows data with deleted parts of sync and PDO. (https://gist.github.com/Zunchao/b501bbda4af8facdaeda#file-candump-2015-11-04_093045)

Zunchao commented 8 years ago

@ipa-mdl one more question: when you run _rosrun canopen_testutils prepare.sh, the status of can0 is:

31: can0: < NOARP,UP,LOWER_UP,ECHO > mtu 16 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 20

why is the state UNKNOWN but not UP?

mathias-luedtke commented 8 years ago

why is the state UNKNOWN but not UP?

This relates to the kernel driver. UNKNOWN seems to be okay..

The thing is when you relaunch the robot node, it works well. But if you restart the arm, same results: first init fails, after that no errors.

Can explain this a little more? Do you restart the arm while the driver is running?

I am still not clearly understanding the mechanism of the initialization. For a clearer view, I postponed(deleted) the sync and PDO parts, data as in candump-2015-11-04_093045.log. For example, from Line 475 to Line 499, it's a switch state process, the former node (node 7) goes to 'operational', then a two-stage follows (stage one: Line 476 ~ Line 483; stage two: Line 484 ~ Line 499), the first stage finishes the former five operational nodes? The second stage seems to start the bootup of the next node and reset the former five nodes? Is that right? If it is, can we bootup the new node directly, without reset the former nodes? Because it seems the messages in the 1st stage duplicated again in the 2nd stage.

I guess you mix some things up. Lines 475 too 499are fine. (1446625856.848025) can0 000#0107 starts node 7

(1446625857.765490) can0 603#4001100000000000
(1446625857.766446) can0 583#4F01100000000000

Does not belong to the initialization. This is part of the asynchonous diagnostics job that polls the error register of all initialized nodes.

(1446625858.848137) can0 607#2F03100000000000 # SDO write
(1446625858.848290) can0 603#4001100000000000 
(1446625858.849317) can0 087#0000000000000000 # error reset
(1446625858.849593) can0 583#4F01100000000000
(1446625858.849765) can0 604#4001100000000000
(1446625858.849836) can0 587#6003100000000000 # SDO ack

Resets the error counter for node 7.

mathias-luedtke commented 8 years ago

And please stop sending attachment via e-mail, use https://gist.github.com/ instead. That way other can keep track of the conversation as well.

Zunchao commented 8 years ago

@ipa-mdl Thanks for your patient reply.

The thing is when you relaunch the robot node, it works well. But if you restart the arm, same results: first init fails, after that no errors.

Can explain this a little more? Do you restart the arm while the driver is running?

yes, I restart the arm while the driver is running: what I did is, i turned off all the power switches of the manipulator and then turned on, leaving the pcan usb plugged into the computer, then carried on the next test(run prepare.sh and launch robot.launch). Usually the first time you launch the robot node, errors like “CAN not ready”. Then stop the robot node, run the prepare.sh and relaunch the robot node agian, it will work well without errors. After that, you stop and start the robot node repeatedly, all works well, until you restart the manipulator again.

mathias-luedtke commented 8 years ago

Please try to run prepare.sh before you power on the robot. The robot sends boot-up messages at start-up and it might lead to problems if these are not read from the bus.

It addition I recommend the network device in the config (http://wiki.ros.org/socketcan_interface#Initialize_NIC) and to run prepare.sh only if you really need to restart the kernel driver for the device.

Zunchao commented 8 years ago

@ipa-mdl I tested it with your suggestions, still the same performance. Actually it is stable enough as long as initializing success, but only the first running... My interfaces config file is:

interfaces(5) file used by ifup(8) and ifdown(8)

auto lo iface lo inet loopback

allow-hotplug can0 iface can0 can static

 bitrate 500000
 up ip link set can0 txqueuelen 20

Furthermore, could you tell me what does the 'controller problems' in the responds mean? Is it a problem of hardware? Or?

...
error: 4
[ INFO] [1446716523.397990181]: Current state: 2 device error: system:0 internal_error: 4 (controller problems;)
[ INFO] [1446716523.398019481]: Current state: 1 device error: system:0 internal_error: 4 (controller problems;)
ID: 4
...
mathias-luedtke commented 8 years ago

I tested it with your suggestions, still the same performance

I will give it a try on our LWA4D as soon as I find some time.

Furthermore, could you tell me what does the 'controller problems' in the responds mean? Is it a problem of hardware? Or?

It means that socketcan_interface received an error frame.. The full message is not printed, but you can read it with candump. https://github.com/torvalds/linux/blob/master/include/uapi/linux/can/error.h#L52

mathias-luedtke commented 8 years ago

I was not able to reproduce your error. I have run tests on a LWA4D and and single powerball.

My test procedure was:

  1. robot is off
  2. connect all CAN cables
  3. plug CAN dongle in
  4. start CAN netdev (runs automatically with the networks config above)
  5. power on robot
  6. roslaunch schunk_lwa4p robot.launch
  7. rosservice call /arm/driver/init

(4. and 5. could be switched.)

Zunchao commented 8 years ago

Thanks very much for your time, I have learn more and will test more on it.

mathias-luedtke commented 8 years ago

I faced the same difficulties with our arm now.. Somebody has changed the current limit on our power supply.

Please make sure that you have a power supply that outputs up to 12A (according to LWA4P manual/datasheet). Or use two separate supplies for logic and motors.

Zunchao commented 8 years ago

Sorry to feedback late, the output of our power supply is DC24V/20A, I think it satisfies completely.

ammarnajjar commented 8 years ago

Hello, I followed the steps mentioned in ipa-mdl comment, but I still could not be able to initialize my schunk lwa4p powerball robot arm, for I always get the error:

error: 2
[ INFO] [1464604147.694003751]: Current state: 2 device error: system:0 internal_error: 2 (lost arbitration;)
[ INFO] [1464604147.694023701]: Current state: 1 device error: system:0 internal_error: 2 (lost arbitration;)
ID: 2
error: 2
ID: 2
[ERROR] [1464604147.703153803]: CAN not ready
Did not receive a response message

Here is the script that I use to prepare ros-indigo under ubuntu 14.04 and can0 configs. Here are the logs. Any hint or advice toward solving the issue would be appreciated. @ipa-mdl

mathias-luedtke commented 8 years ago

@ammarnajjar: I don't think that your error is related to this issue.

internal_error: 2 (lost arbitration;)

Lost arbitration should almost never happen. Please double check your cabling and your end resistors.

Your scipt and logs look fine until 606b is read from module 3. You might get more detail if you run candump -l can0,0:0,#FFFFFFFF. This will log all error messages and thier payload data.

Do you know which firmware your LWA4P is running on?

PS: +1 for all the well-prepared logs and the great script :)

changzhengwu commented 6 years ago

Dear sir My manipultor is schunk lwa4d, and my can device is pcan-usb x6, My steps are:

  1. power on
  2. connect can device and start the can driver
  3. roslaunch schunk_lwa4d robot .launch
  4. roslaunch schunjk_lwa4d dashbord.launch

the error like this

waitForService:Service [/arm/controller_manager/load_controller] has not been advertised, waiting... Initializing XXX

[ INFO] [1507948971.101843868]: Current state: 1 device error: system:0 internal_error: 0 (OK)

[ INFO] [1507948971.102038180]: Current state: 2 device error: system:0 internal_error: 0 (OK) [ INFO] [1507948971.102229033]: Current state: 2 device error: system:100 internal_error: 0 (OK) [ INFO] [1507948971.102358100]: Current state: 1 device error: system:100 internal_error: 0 (OK) [ERROR] [1507948971.112391534]: CAN not ready

what should i do to handle this error? @Zunchao @ipa-fxm @ipa-mdl @ammarnajjar @ipa-fmw

fmessmer commented 6 years ago

this issue is being warmed up over and over... there are sufficient answers for the original question all new problems/questions should go into a new issue