Analysis of FW of mc4plus to solve board's disappearance from the ETH network

marcoaccame commented 3 years ago

Following issue in here, we have identified a critical behaviour that should be investigated and solved: some mc4plus boards disappears from the ETH network when running yarprobotinterface.

What happens is that an mc4plus board which offers the motion control service may stop transmitting and receiving UDP frames. The board maintains PWM actuation and seems to be stuck: it does not respond to ping and is not seen by FirmwareUpdater.

We have observed the board 10.0.1.24 to have such a misbehavior but also the board 10.0.1.20.

A careful analysis is required using multiple investigation paths. Amongst them:

analysis of code differences in the motion control between mc4plus and ems in an attempt to spot bugs in the differences;
attempt to recreate the misbehavior on a simplified setup, with the debugger connected to the mc4plus board so that we can trace exact behaviour;
investigate if the misbehavior is possible also on a similar setup with the ems board;
investigate if the misbehavior is possible also w/ other services such as F/T streaming etc;
validate the behaviour / solutions found / tested on the simplified setup also on the robot.

cc: @S-Dafarra @DanielePucci @maggia80

marcoaccame commented 3 years ago

An idea for the simplified setup is:

a linux pc which communicates via ETH to an mc4plus / ems board by sending UDP frames which simulates the repeated activation / deactivation of the motion control service (and also others) in the same way as yarprobotinterface does.
an mc4plus / ems board with attached a ULINK Pro debugger.
presence of sensors and actuators possibly simulated.

marcoaccame commented 3 years ago

Here are some news: I can now systematically reproduce the bug on a dedicated setup.

I have prepared the simplified setup which is able to run 4 possible tests: either the FT service or the MC service on either the ems or the mc4plus.

For the FT service I have used a true FT sensor on CAN1 (the strain2), whereas for the MC service I have just emulated sensors and did not use any actuator. I emulated the AEA sensors, I did not put any 2foc boards on CAN (but I added the strain2 to avoid the overflow of the FIFO of CAN frames) nor I attached any DC motor.

The Linux PC sends a series of commands via UDP which every 5 seconds trigger the cycle of {eomn_serv_operation_verifyactivate, eomn_serv_operation_start, eomn_serv_operation_stop, eomn_serv_operation_deactivate}.

This is the standard sequence of commands which yarprobotinferface send when it is started and when it is closed. I have omitted the configuration commands for the given service (PIDs, control mode, FT streaming rate etc.) just to simplify as most as possible. But not too much.

Here are the results of the first run of tests.

board	service	notes	result	comment
`ems`	`FT`	used the true FT sensor on CAN	OK	it replies to ping even after 100 cycles
`mc4plus`	`FT`	used the true FT sensor on CAN	OK	it replies to ping even after 25 cycles
`ems`	`MC`	used emulated AEA and no 2foc	KO	it does not reply to ping after a few cycles.
`mc4plus`	`MC`	used emulated AEA and no DC motors	KO	it does not reply to ping after a few cycles.

Question: could the lack of sensors and actuator cause the problem? Answer: I don't think so, but ... I will carefully check.

Now I can start a deeper analysis of what happens. For this I can use the debugger which I have fitted to each board.

marcoaccame commented 3 years ago

Yesterday, I did some further tests w/ different optimization levels.

With a reduced optimization level I spotted a diagnostic message over the trace port which tells about a failure of memory allocation. I will now look at possible causes.

marcoaccame commented 3 years ago

I think that I found the problem. There is surely memory erosion due to reallocation of RAM at each start of the iteration (and hence at every relaunch of yarprobotinterface). Details will follow in a separate comment.

I am now extensively testing a solution which has already proven to successfully run 100+ cycles of MC service on the mc4plus board of my setup.

My goal is to issue a mc4plus.releasecandidate.hex on Monday / Tuesday, so that it can be tested also on the robot.

S-Dafarra commented 3 years ago

I think that I found the problem. There is surely memory erosion due to reallocation of RAM at each start of the iteration (and hence at every relaunch of yarprobotinterface). Details will follow in a separate comment.

I am now extensively testing a solution which has already proven to successfully run 100+ cycles of MC service on the mc4plus board of my setup.

My goal is to issue a mc4plus.releasecandidate.hex on Monday / Tuesday, so that it can be tested also on the robot.

Is it something that may happen also without restarting the yarprobotinterface? I think that usually we run the yarprobotinterface only 3 or 4 times without rebooting also the motors.

That's great news anyways :muscle:

marcoaccame commented 3 years ago

Is it something that may happen also without restarting the yarprobotinterface? I think that usually we run the yarprobotinterface only 3 or 4 times without rebooting also the motors.

it may. In https://github.com/robotology/icub-firmware/issues/174#issuecomment-789731108 I wrote: it does not reply to ping after a few cycles. I don't exactly recall now how many cycles they were, but 3 / 4 seems possible to me.

marcoaccame commented 3 years ago

think that usually we run the yarprobotinterface only 3 or 4 times without rebooting also the motors.

It would be interesting to see what happens if you manually restart yarprobotinterface several times w/out shutting motors' power off in between. Let's say: 10 times in a row. Does the problem happens? Can anybody test that next week?

S-Dafarra commented 3 years ago

think that usually we run the yarprobotinterface only 3 or 4 times without rebooting also the motors.

It would be interesting to see what happens if you manually restart yarprobotinterface several times w/out shutting motors' power off in between. Let's say: 10 times in a row. Does the problem happens? Can anybody test that next week?

I can give it a try

S-Dafarra commented 3 years ago

I just tried running the yarprobotinterface many times in a row, and the limit I reached was 5. At the 5th reboot the robot did not start with the error

theNVmanager::Impl::wait() had a timeout for BOARD torso-eb5-j0_2 IP 10.0.1.5 and nv ID32 = 0x00030001 -> IND = 0, TAG = eoprot_tag_mn_service_status_commandresult

I don't know if it is related. I tried to reboot the motors, run it for 4 times, and the 5th again the same error. Here a screenshot of the FirmwareUpdater

Here the log: log_icub-head_yarprobotinterface_5224_5thtime.txt

Here the application file I used: https://github.com/dic-iit/robots-configuration/blob/1849cb4b3fe4d433344242e48c3675729f92dc99/iCubGenova04/icub_test.xml

marcoaccame commented 3 years ago

I just tried running the yarprobotinterface many times in a row, and the limit I reached was 5. At the 5th reboot the robot did not start with the error

Very good. It is as I expected. From FirmwareUpdater I can see many boards missing, not only the 10.0.1.5, which is an ems.

marcoaccame commented 3 years ago

Hi @S-Dafarra,

I have produced two binaries for the ems and the mc4plus which I would like to test on iCubGenova04:

ems-test-v103.36.zip mc4plus-test-v103.29.zip

They are compatible w/ the latest devel / master release.

I have tested the solutions added to these binaries w/ 100 cycles of mc service and the boards can still be pinged.

S-Dafarra commented 3 years ago

Hi @S-Dafarra,

I have produced two binaries for the ems and the mc4plus which I would like to test on iCubGenova04:

ems-test-v103.36.zip mc4plus-test-v103.29.zip

They are compatible w/ the latest devel / master release.

I have tested the solutions added to these binaries w/ 100 cycles of mc service and the boards can still be pinged.

Hi @marcoaccame! That's great! Unfortunately I am not in the lab at the moment, but the robot should be free if you want to test it. Otherwise, I will fash it in the following days!

marcoaccame commented 3 years ago

Hi @marcoaccame! That's great! Unfortunately I am not in the lab at the moment, but the robot should be free if you want to test it. Otherwise, I will fash it in the following days!

@S-Dafarra, I will be in the lab next week. In the meantime I will document the changes.

S-Dafarra commented 3 years ago

Hi @marcoaccame! That's great! Unfortunately I am not in the lab at the moment, but the robot should be free if you want to test it. Otherwise, I will fash it in the following days!

@S-Dafarra, I will be in the lab next week. In the meantime I will document the changes.

Sounds good. Indeed, since the problem on the wrists might be difficult to check, I think it would be definitely worthwhile to proceed anyway. Then, with time we can check if the wrist problem appears again.

marcoaccame commented 3 years ago

Sounds good. Indeed, since the problem on the wrists might be difficult to check, I think it would be definitely worthwhile to proceed anyway. Then, with time we can check if the wrist problem appears again

I am sure that the changes I have done solve some problems (the ones verified on my setup) and we can surely add into robotology. However, before approving the PR it is our habit to test it on the robot. I will have a go at it next week. Maybe we can do it together so that we test in the same way the robot is effectively used.

marcoaccame commented 3 years ago

Hi, here is the code used to compile the two test binaries of https://github.com/robotology/icub-firmware/issues/174#issuecomment-795451305

I will produce a PR into robotology that will be merged after the tests on the robot.

marcoaccame commented 3 years ago

What happened?

multiple starts and stops of yarprobotinterface asking the MC service to the boards caused a fatal error in the mc4plus and ems boards for multiple reasons:
- on the mc4plus because: function hal_quadencoder_init_indexes_flags() was called without checking if the associated IRQ Handler was inactive (as it must be) before configuration of HW peripherals which the IRQ handler uses.
- on the ems because: the pair of functions hal_spiencoder_init() and hal_spiencoder_deinit() did not manage RAM correctly and the hal_spiencoder_init() consumed memory at each start of yarpbotinterface. So, after a few start / stop cycles the board detected end of available RAM.

I have also spotted a memory leak in the EOCurrentsWatchdog module used by the mc4plus board, which is however crashing first due to hal_quadencoder_init_indexes_flags().

marcoaccame commented 3 years ago

Results of the tests of iCubGenova04: OK

Did extensive tests on iCubGenova04 during the the whole day of 19 mar 2021. Used binaries for ems and mc4plus complied w/ the option SPIENC_DEINIT_DEALLOCATE_HEAP enabled.

Launched yarprobotinterface --config icub_wbd.xml and then stopped it for > 20 times. Left the robot running at least a couple of times for > 30 minutes.

The boards did not disappear.

The current release of master instead caused the board to disappear after 2 or 3 srat / stop cycles.

The used binaries work fine.

marcoaccame commented 3 years ago

What now: rebased the fork in https://github.com/marcoaccame/icub-firmware/tree/fix/multiplestartofmc vs latest release of https://github.com/robotology/icub-firmware/tree/devel which has evolved so far and updated the application versions of ems, mc4plus, mc2plus for the PR.

It may be worth a new quick test on the robot.

pattacini commented 3 years ago

Nice @marcoaccame 👍🏻 Consider that we're quickly approaching a deadline for the new distro 2021.02.feat-01.

cc @Nicogene

marcoaccame commented 3 years ago

It may be worth a new quick test on the robot.

Tested once more on 24 mar 2021 after the rebase of icub-firmware, rebuilt the binaries to be sure the binaries in here were correct

The tests were OK, hence I merged the two PRs: https://github.com/robotology/icub-firmware-build/pull/25 and https://github.com/robotology/icub-firmware/pull/176

cc: @pattacini @maggia80 @S-Dafarra

pattacini commented 3 years ago

Thanks @marcoaccame !

Closing, feel free to open it up again if needed.

marcoaccame commented 3 years ago

reopened following https://github.com/robotology/icub-tech-support/issues/673#issuecomment-811814586

marcoaccame commented 3 years ago

Tested a mechanism which does the following:

whenever the mc4plus enters in the handler of a fatal error, it writes some info in a log area placed in permanent RAM and then forces a restart
if the mc4plus at startup finds any info in a log area of its permanent RAM then it sends a diagnostics message over ETH.

The log area cannnot hold more than 15 byte, so I will log:

time since bootstratp in milliseconds
id of the handler function (s_osal_cfg_on_fatal_error(), s_ipal_cfg_on_fatal_error(), s_hal_core_cfg_on_fatalerror(), various system HW handlers such as HardFault_Handler() etc, ...)
error code inside each handler (such as osal_error_stackoverflow, osal_error_missingmemory, etc.)
if the caller is an IRQ Handler or a thread
ID of the last scheduled thread

I did a basic test for this mechanism, now I need to fill all the fatal error handlers and test some cases.

Then, I will release a new version for the mc4plus application which can hopefully tell us more about what causes the problem.

marcoaccame commented 3 years ago

tested w/ code such as this:

typedef struct
{
    uint32_t millisecondsfromstart;
    uint8_t handlertype;
    uint8_t handlererrorcode;
    uint8_t calledbyanirqhandler;
    uint8_t idofthelastscheduledthread;
    uint8_t forfutureuse0;
    uint8_t forfutureuse1;
    uint16_t signature;
} fatal_error_message_t;    EO_VERIFYsizeof(fatal_error_message_t, 12)

typedef struct
{
    uint64_t                par64;
    uint16_t                par16;
} fatal_error_params_t;     EO_VERIFYsizeof(fatal_error_params_t, 16)

typedef union 
{
    fatal_error_message_t   message;
    fatal_error_params_t    params;    
} fatal_error_t; EO_VERIFYsizeof(fatal_error_t, 16)

Code Listing: Data structures exchanged across a restart of the mc4plus board.


    #include "eEsharedServices.h"

    static volatile fatal_error_t detectedfatalerror = {0};
    static uint8_t detectedsize = 0;
    // read ipc memory
    if(ee_res_OK == ee_sharserv_ipc_userdefdata_get((uint8_t*)&detectedfatalerror, &detectedsize, sizeof(fatal_error_t)))
    {
        // there is something. 
        // i clear ipc ram
        ee_sharserv_ipc_userdefdata_clr();        
        if((detectedsize <= sharserv_base_ipc_userdefdata_maxsize) && (0x1234 == detectedfatalerror.message.signature))
        {
            // it is of the correct size. i use it to send a diagnostic message
            uint16_t par16 = detectedfatalerror.params.par16;
            uint64_t par64 = detectedfatalerror.params.par64; 

            eOerrmanDescriptor_t errdes = {0};

            errdes.code             = eoerror_code_get(eoerror_category_Debug, eoerror_value_DEB_tag06);
            errdes.sourcedevice     = eo_errman_sourcedevice_localboard;
            errdes.sourceaddress    = 0;
            errdes.par16            = par16;
            errdes.par64            = par64;
            eo_errman_Error(eo_errman_GetHandle(), eo_errortype_error, "RESTARTED after FATAL error", NULL, &errdes);
        }                        
    }

Code Listing: Code executed by the mc4plus board at startup.. If a message in IPC memory is found then a diagnostic message is sent to yarprobotinferface.

    // in here there is just a test of emission of ipc data w/ a restart

    static fatal_error_t tobewritten = {0}; 

    tobewritten.message.millisecondsfromstart = osal_system_ticks_abstime_get() / 1000; // better using another way so that we avoid calling an svc
    tobewritten.message.handlertype = 1;
    tobewritten.message.handlererrorcode = 3;
    tobewritten.message.calledbyanirqhandler = 0;
    tobewritten.message.idofthelastscheduledthread = 9;
    tobewritten.message.forfutureuse0 = 1;
    tobewritten.message.forfutureuse1 = 2;
    tobewritten.message.signature = 0x1234;

    // write in ipc memory and ... restart
    ee_sharserv_ipc_userdefdata_set((uint8_t*)&tobewritten, sizeof(tobewritten));
    ee_sharserv_sys_restart();

Code Listing: Code to be included in the fatal error handlers. It write some info in the IPC memory and then forces a restart.

This code was tested on an mc4plus board. The effect is that the mc4plus sends a message such as this:

[ERROR] (EO? tsk2 @S5:m632:u723)-> {0x4000006 p16 0x0201, p64 0x0900030100001600, dev 0, adr 0}: 
DEBUG: tag06. INFO = RESTARTED after FATAL error

Sadly, yarprobotinterface will not print (yet) in human readable format, but it should be easy to decode p64 which contains in its lower 4 bytes the millisecondsfromstart (0x00001600 = 5sec 632milli), then the handlertype (0x01), the handlererrorcode (0x03) etc.

The IPC messaging was designed years ago to carry up to 15 bytes. So far we use only 8 bytes, so we can maybe add some more without changing the IPC implementation. To extend to more bytes is surely feasible but it needs to reflash eLoader, eUpdater, eApplication.

marcoaccame commented 3 years ago

We have just successfully tested a new FW for the mc4plus board which is able to send diganostics infor to yarprobotinetrfce in case of fatal errors.

In such a case, it forces a restart and prints info such as in the following (where we artificially forced a stack overflow from thread runDO after 20 sec of start of teh board)

[**INFO**]  from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 953m 401u: 
(code 0x0000003b, par16 0x0000 par64 0x0000000000000000) -> SYS: the board is bootstrapping + . 

[**ERROR**]  from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 7u: 
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + RESTARTED after FATAL error 

[**ERROR**]  from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 117u: 
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + @ 20000 ms 

[**ERROR**]  from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 235u: 
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + handler OSAL, code 0xe5 

[**ERROR**]  from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 348u: 
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + type osal_stackovf 

[**ERROR**]  from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 467u: 
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + IRQHan SVCall Thread runDO 

[**ERROR**]  from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 581u: 
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + ipsr 11, tid 11

List. Board 10.0.1.1 has detected a fatal error (first message of type DEBUG: tag00 w/ string RESTARTED after FATAL error) at its execution time 20 sec (second message with string 20000 ms). The third and fourth message tell that the error was caused by the OSAL handler and is due to stack overflow (see string ype osal_stackovf). The handler was called by the IRQHandler SVCall which is the one which does thread switching and the error was caused by last scheduled thread called runDO (it is the one which ticks all teh services at 1 kHz.

In case of no error, teh FW works just fine as the latest devel version. Test were don today on iCubGenova09.

marcoaccame commented 3 years ago

I shall soon produce a PR.

GiulioRomualdi commented 3 years ago

Hi @marcoaccame these are two logs containing the failures you're investigating in this issue:

this happens in the left pronosup
while this for the pitch and roll joints of the head

pattacini commented 3 years ago

Cool, thanks @GiulioRomualdi for reporting on this. @marcoaccame is off till the last week of August. We will then get back to this those days.

DanielePucci commented 3 years ago

See https://github.com/robotology/icub-tech-support/issues/673#issuecomment-900634451 for the looming outdoor demos that would strongly benefit from a definite solution to this issue

marcoaccame commented 3 years ago

Hi @marcoaccame these are two logs containing the failures you're investigating in this issue:

this happens in the left pronosup

while this for the pitch and roll joints of the head

Related issue: robotology/icub-tech-support#673

Hi @GiulioRomualdi, the logs are obtained w/ a FW version of the mc4plus which is older than the one described in here, so the logs don't contain any useful (new) information.

so, now we need to:

ensure that the mc4plus fw is the latest version of devel (https://github.com/robotology/icub-firmware-build/pull/32 or newer) which contains improved diagnostics as described by this PR.
wait for another hard fault to happen

traversaro commented 3 years ago

* ensure that the mc4plus fw is the latest version of devel

Just for general alignment, that version of the firmware was release as part of v2021.08.0 disto, so just using icub-firmware-build v1.21.0 is probably enough.

robotology / icub-firmware

Analysis of FW of mc4plus to solve board's disappearance from the ETH network #174