robotology / icub-tech-support

Virtual repository that provides support requests for individual robots
GNU General Public License v2.0
20 stars 1 forks source link

iCubWaterloo01 S/N:044 – Problem with joints control board, Left_arm Cartesian interface stops working #1761

Closed paliasgh closed 1 month ago

paliasgh commented 4 months ago

Robot Name 🤖

iCubWaterloo01 S/N:044

Request/Failure description

Hello,

As suggested, I am moving the discussion about the Cartesian solver stop working to here.

As origninally mentioned, often the robot randomly freezes and we need to restart Cartesian interfaces. Errors like this appear iKinCartesianSolver of left_arm:

77.315728   WARNING         cartesianSolver/left_arm: timeout detected on part left_arm
77.335835   WARNING         cartesianSolver/left_arm: timeout detected on part left_arm
77.355944   WARNING         cartesianSolver/left_arm: timeout detected on part left_arm
77.376089   WARNING         cartesianSolver/left_arm: timeout detected on part left_arm
77.678015   INFO            cartesianSolver/left_arm suspende

At the same time, yarprobotinterface starts publishing these messages/errors:

75.495072   DEBUG           eth::EthMonitorPresence: BOARD 10.0.1.3 (right_arm-eb3-j0_3) has been silent for 0.02492 sec (its timeout is 0.02 sec
75.495103   DEBUG           eth::EthMonitorPresence: BOARD 10.0.1.5 (torso-eb5-j0_2) has been silent for 0.020252 sec (its timeout is 0.02 sec
75.495111   DEBUG           eth::EthMonitorPresence: BOARD 10.0.1.6 (left_leg-eb6-j0_3) has been silent for 0.0249259 sec (its timeout is 0.02 sec
75.495118   DEBUG           eth::EthMonitorPresence: BOARD 10.0.1.7 (left_leg-eb7-j4_5) has been silent for 0.0201583 sec (its timeout is 0.02 sec
75.495124   DEBUG           eth::EthMonitorPresence: BOARD 10.0.1.8 (right_leg-eb8-j0_3) has been silent for 0.0202615 sec (its timeout is 0.02 sec
75.495133   DEBUG           eth::EthMonitorPresence: BOARD 10.0.1.9 (right_leg-eb9-j4_5) has been silent for 0.0202682 sec (its timeout is 0.02 sec
75.495140   DEBUG           eth::EthMonitorPresence: BOARD 10.0.1.10 (left_leg-eb10-skin) has been silent for 0.0203071 sec (its timeout is 0.02 sec
75.495146   DEBUG           eth::EthMonitorPresence: BOARD 10.0.1.26 (left_arm-eb26-j12_15) has been silent for 0.0249617 sec (its timeout is 0.02 sec
75.495152   DEBUG           eth::EthMonitorPresence: BOARD 10.0.1.27 (right_arm-eb27-j4_7) has been silent for 0.0201998 sec (its timeout is 0.02 sec
75.495158   DEBUG           eth::EthMonitorPresence: BOARD 10.0.1.28 (right_arm-eb28-j8_11) has been silent for 0.0203078 sec (its timeout is 0.02 sec
75.495164   DEBUG           eth::EthMonitorPresence: BOARD 10.0.1.29 (right_arm-eb29-j12_15) has been silent for 0.0203502 sec (its timeout is 0.02 sec
75.515187   ERROR           from BOARD 10.0.1.25 (left_arm-eb25-j8_11) time=5742s 404m 921u :  ETH monitor: link goes down  in port ETH output (P3/P12/J5). Application state is unknown
75.525241   DEBUG           from BOARD 10.0.1.1 (left_arm-eb1-j0_3) time=5742s 177m 937u :  src CAN1, adr 1,(code 0x04000002, par16 0x8119 par64 0x0000000000000000) -> DEBUG: tag02  
75.545349   DEBUG           from BOARD 10.0.1.1 (left_arm-eb1-j0_3) time=5742s 194m 967u :  src CAN1, adr 4,(code 0x04000002, par16 0x8149 par64 0x0000000000000000) -> DEBUG: tag02  
75.796771   DEBUG           from BOARD 10.0.1.1 (left_arm-eb1-j0_3) time=5742s 445m 954u :  src CAN1, adr 2,(code 0x04000002, par16 0x8129 par64 0x0000000000000000) -> DEBUG: tag02  
76.088532   DEBUG           from BOARD 10.0.1.1 (left_arm-eb1-j0_3) time=5742s 738m 955u :  src CAN1, adr 3,(code 0x04000002, par16 0x8139 par64 0x0000000000000000) -> DEBUG: tag02  
76.128797   DEBUG           from BOARD 10.0.1.1 (left_arm-eb1-j0_3) time=5742s 777m 943u :  src CAN1, adr 1,(code 0x04000002, par16 0x8119 par64 0x0000000000000000) -> DEBUG: tag02  
76.440660   DEBUG           from BOARD 10.0.1.1 (left_arm-eb1-j0_3) time=5743s 94m 971u :  src CAN1, adr 4,(code 0x04000002, par16 0x8149 par64 0x0000000000000000) -> DEBUG: tag02  
76.470832   ERROR           Encoder Timestamps are not consistent! Data will not be published
76.480885   ERROR           Encoder Timestamps are not consistent! Data will not be published
76.490939   ERROR           Encoder Timestamps are not consistent! Data will not be published
76.500991   DEBUG           from BOARD 10.0.1.1 (left_arm-eb1-j0_3) time=5743s 145m 949u :  src CAN1, adr 2,(code 0x04000002, par16 0x8129 par64 0x0000000000000000) -> DEBUG: tag02  
76.501007   ERROR           Encoder Timestamps are not consistent! Data will not be published
76.511046   ERROR           Encoder Timestamps are not consistent! Data will not be published
76.521100   ERROR           Encoder Timestamps are not consistent! Data will not be published
76.531133   ERROR           Encoder Timestamps are not consistent! Data will not be published

When I try to stop the modules in startup application, iKinCartesianSolver of right_arm and yarprobotinterface both fail to close properly and I will have to kill them.

Thanks for all the assistance.

Detailed context

Here is a new, complete log file of startup application:

yarprunlog_29_02_2024_16_15_42.log

I ran everything on icub-head computer to see if the problem is in the communication between local-host and icub-head. The problem was still there. The issue only happens in the left arm. Here is also how the system status look like. There was a sudden drop in the network usage right at the same time when the issue happened:

58AA58A2-0554-4C5B-9F8F-D4873DBEF88A_1_201_a

Additional context

No response

How does it affect you?

No response

paliasgh commented 3 months ago

Hello @AntonioConsilvio @sgiraz @pattacini , this is a gentle reminder about this issue. It is now happening almost every time we work with the robot, unfortunately. Thank you very much in advance.

paliasgh commented 2 months ago

Hello again @AntonioConsilvio @sgiraz @pattacini, I just wanted to again kindly remind you about our issue. Thanks a lot for your help.

AntonioConsilvio commented 2 months ago

Hi @paliasgh, during this period we had to focus on completing tasks with tighter deadlines. However, we have not forgotten about this issue and in the coming period we intend to plan the support service!

AntonioConsilvio commented 2 months ago

Hi @paliasgh, sorry for the late reply!

From the information unfortunately I am not yet able to provide a sure solution, so I need some information, but I can also give some tips that might be helpful:

Tips:

Looking at the nature of the problem, I can think of two probable causes:

1) The problem could be the Ethernet connection. Since the first board to complain about this problem seems to be the MC4-PLUS eb25 (10.0.1.25), I would check the ETH connection between the EB24 board and the EB25 board (cable name E24).

2) The problem could be the power supply of the EB25 board (10.0.1.25). If the board has a damaged power supply cable, this may cause the board to suddenly switch off and thus cause the ETH line to drop out.

To check the status of the eb25 board, you should remove the upperarm cover (if you don't know how, I can provide instructions) and find 3 MC4-PLUS boards (specifically eb24, eb25 and eb26). The ETH cable (E24) will be a very short wire connecting the EB24 (the one closest to the mechanics) and EB25 (the middle one):

mc4plussss

mc4plus

instead, to check the power supply you should check the yellow and black cable connected to the EB25 board:

power

I advise you to check the status of these two cables, and check if at the time the problem occurs, the eb25 board reboots or remains switched on.

Useful information (in case the tips didn't work):

Please, send us feedback and if you have any questions feel free to ask!

pattacini commented 2 months ago

Hi @AntonioConsilvio

Let me add some insights regarding

-I am not clear what physically happens to the robot when the error appears. Does the robot remain frozen and can no longer move from yarpmotorgui? And if so, does the whole robot remain stuck (including the torso and legs)?

-Does this problem only occur when the Cartesian is active?

The long comm timeout occurring on the YARP network of the right arm, during which we don't have fresh updates of the encoders, causes the Cartesian solver to pause. This is a safety countermeasure implemented by the SW as the system is clearly unstable. At that point, the solver can receive a message to restart its operations but the timeout will occur again.

Other pieces of SW (e.g., the yarpmotorgui) may not rely on the same safety protection and can stay operational.

paliasgh commented 2 months ago

Hello @AntonioConsilvio and @pattacini , thanks you for your help!

I opened the left upperarm cover to see the boards. I am currently running things with the robot to see what happens when the Cartesian stops, however, it seems like it is not going to fail at all when the cover is not there! It has been about one hour that the robot is functioning without any problem. In this case, is it most likely a problem with cables when they are compressed? I then need to check them more carefully.

-- UPDATE: I tried pressing/moving those cables with a narrow plastic stick to see which one causes the problem. It seems like the problem is cable E25. It is the one that connects eb25 to eb26. As it is not a power cable, when the failure happens (errors appears in the log) the boards remain on (actually no light or etc. changes in the boards). Looking at E25, especially the connectors, things look visually okay, but not too great. Is there any other way we can verify that the issue is E25, not E24, before going forward with replacing the cable? Or maybe replacing both E24 and E25 is the best? Any advise? Thanks!

Image (2) Image (1)

AntonioConsilvio commented 2 months ago

Thanks for the insight @pattacini!

Hi @paliasgh! I considered the damage to the E24 cable more probable, however I would not exclude the possibility, that the E25 could also cause the problem.

There is no problem with double replacement (both E24 and E25).

However, if you want to clarify which of the two cables is giving problems there are some tests that you can do:

1) Check the ETH LED on the board (when the error appears). In fact, the MC4-PLUS has a led behind the ETH connectors, which when switched on confirms correct ETH communication:

![IMG_20240422_141511](https://github.com/robotology/icub-tech-support/assets/114915464/696c5c66-4ce4-4e6e-9c32-92da1e307417)

2) Test the continuity of the cable with the multimeter.

3) You can try testing the crimping of the cable with tweezers, as in this video:

https://github.com/robotology/icub-tech-support/assets/114915464/801adf61-1312-4a0f-99fa-3175fac4bf6a

You can try pulling the cable with tweezers, without tearing it, just to see if the crimp still holds.

Please, send us feedback and if you have any questions feel free to ask!

paliasgh commented 1 month ago

Hello @AntonioConsilvio @pattacini, thanks a lot for all your support!

Last week, we replaced both E24 and E25 with pre-assembled cables (we had to reverse the order of wires in one head). The robot seems to be working without any major issue, with the arm covers all attached, since then! The Cartesian interface only failed 3 times during more than 50 hours of work. It maybe related to other things, I could not check the log since it happened very rarely.

Checking the original cables, I find out the issue was actually with the red wire of E25 cable (the video below):

https://github.com/robotology/icub-tech-support/assets/59978262/ddd0c937-dad6-48d9-aee5-7491c1a9d419

IMG_6335

Again, many thanks for your help!