Open marcoaccame opened 3 years ago
An idea for the simplified setup is:
yarprobotinterface
does.Here are some news: I can now systematically reproduce the bug on a dedicated setup.
I have prepared the simplified setup which is able to run 4 possible tests: either the FT service
or the MC service
on either the ems
or the mc4plus
.
For the FT service
I have used a true FT sensor on CAN1 (the strain2
), whereas for the MC service
I have just emulated sensors and did not use any actuator. I emulated the AEA sensors, I did not put any 2foc boards on CAN (but I added the strain2
to avoid the overflow of the FIFO of CAN frames) nor I attached any DC motor.
The Linux PC sends a series of commands via UDP which every 5 seconds trigger the cycle of {eomn_serv_operation_verifyactivate
, eomn_serv_operation_start
, eomn_serv_operation_stop
, eomn_serv_operation_deactivate
}.
This is the standard sequence of commands which yarprobotinferface
send when it is started and when it is closed. I have omitted the configuration commands for the given service (PIDs, control mode, FT streaming rate etc.) just to simplify as most as possible. But not too much.
Here are the results of the first run of tests.
board | service | notes | result | comment |
---|---|---|---|---|
ems |
FT |
used the true FT sensor on CAN | OK | it replies to ping even after 100 cycles |
mc4plus |
FT |
used the true FT sensor on CAN | OK | it replies to ping even after 25 cycles |
ems |
MC |
used emulated AEA and no 2foc | KO | it does not reply to ping after a few cycles. |
mc4plus |
MC |
used emulated AEA and no DC motors | KO | it does not reply to ping after a few cycles. |
Question: could the lack of sensors and actuator cause the problem? Answer: I don't think so, but ... I will carefully check.
Now I can start a deeper analysis of what happens. For this I can use the debugger which I have fitted to each board.
Yesterday, I did some further tests w/ different optimization levels.
With a reduced optimization level I spotted a diagnostic message over the trace port which tells about a failure of memory allocation. I will now look at possible causes.
I think that I found the problem. There is surely memory erosion due to reallocation of RAM at each start of the iteration (and hence at every relaunch of yarprobotinterface
). Details will follow in a separate comment.
I am now extensively testing a solution which has already proven to successfully run 100+ cycles of MC service
on the mc4plus
board of my setup.
My goal is to issue a mc4plus.releasecandidate.hex
on Monday / Tuesday, so that it can be tested also on the robot.
I think that I found the problem. There is surely memory erosion due to reallocation of RAM at each start of the iteration (and hence at every relaunch of
yarprobotinterface
). Details will follow in a separate comment.I am now extensively testing a solution which has already proven to successfully run 100+ cycles of
MC service
on themc4plus
board of my setup.My goal is to issue a
mc4plus.releasecandidate.hex
on Monday / Tuesday, so that it can be tested also on the robot.
Is it something that may happen also without restarting the yarprobotinterface
? I think that usually we run the yarprobotinterface
only 3 or 4 times without rebooting also the motors.
That's great news anyways :muscle:
Is it something that may happen also without restarting the yarprobotinterface? I think that usually we run the yarprobotinterface only 3 or 4 times without rebooting also the motors.
it may. In https://github.com/robotology/icub-firmware/issues/174#issuecomment-789731108 I wrote: it does not reply to ping after a few cycles
. I don't exactly recall now how many cycles they were, but 3 / 4 seems possible to me.
think that usually we run the yarprobotinterface only 3 or 4 times without rebooting also the motors.
It would be interesting to see what happens if you manually restart yarprobotinterface
several times w/out shutting motors' power off in between. Let's say: 10 times in a row. Does the problem happens? Can anybody test that next week?
think that usually we run the yarprobotinterface only 3 or 4 times without rebooting also the motors.
It would be interesting to see what happens if you manually restart
yarprobotinterface
several times w/out shutting motors' power off in between. Let's say: 10 times in a row. Does the problem happens? Can anybody test that next week?
I can give it a try
I just tried running the yarprobotinterface
many times in a row, and the limit I reached was 5. At the 5th reboot the robot did not start with the error
theNVmanager::Impl::wait() had a timeout for BOARD torso-eb5-j0_2 IP 10.0.1.5 and nv ID32 = 0x00030001 -> IND = 0, TAG = eoprot_tag_mn_service_status_commandresult
I don't know if it is related. I tried to reboot the motors, run it for 4 times, and the 5th again the same error.
Here a screenshot of the FirmwareUpdater
Here the log: log_icub-head_yarprobotinterface_5224_5thtime.txt
Here the application file I used: https://github.com/dic-iit/robots-configuration/blob/1849cb4b3fe4d433344242e48c3675729f92dc99/iCubGenova04/icub_test.xml
I just tried running the yarprobotinterface many times in a row, and the limit I reached was 5. At the 5th reboot the robot did not start with the error
Very good. It is as I expected. From FirmwareUpdater
I can see many boards missing, not only the 10.0.1.5, which is an ems
.
Hi @S-Dafarra,
I have produced two binaries for the ems
and the mc4plus
which I would like to test on iCubGenova04:
ems-test-v103.36.zip mc4plus-test-v103.29.zip
They are compatible w/ the latest devel / master release.
I have tested the solutions added to these binaries w/ 100 cycles of mc service and the boards can still be pinged.
Hi @S-Dafarra,
I have produced two binaries for the
ems
and themc4plus
which I would like to test on iCubGenova04:ems-test-v103.36.zip mc4plus-test-v103.29.zip
They are compatible w/ the latest devel / master release.
I have tested the solutions added to these binaries w/ 100 cycles of mc service and the boards can still be pinged.
Hi @marcoaccame! That's great! Unfortunately I am not in the lab at the moment, but the robot should be free if you want to test it. Otherwise, I will fash it in the following days!
Hi @marcoaccame! That's great! Unfortunately I am not in the lab at the moment, but the robot should be free if you want to test it. Otherwise, I will fash it in the following days!
@S-Dafarra, I will be in the lab next week. In the meantime I will document the changes.
Hi @marcoaccame! That's great! Unfortunately I am not in the lab at the moment, but the robot should be free if you want to test it. Otherwise, I will fash it in the following days!
@S-Dafarra, I will be in the lab next week. In the meantime I will document the changes.
Sounds good. Indeed, since the problem on the wrists might be difficult to check, I think it would be definitely worthwhile to proceed anyway. Then, with time we can check if the wrist problem appears again.
Sounds good. Indeed, since the problem on the wrists might be difficult to check, I think it would be definitely worthwhile to proceed anyway. Then, with time we can check if the wrist problem appears again
I am sure that the changes I have done solve some problems (the ones verified on my setup) and we can surely add into robotology. However, before approving the PR it is our habit to test it on the robot. I will have a go at it next week. Maybe we can do it together so that we test in the same way the robot is effectively used.
Hi, here is the code used to compile the two test binaries of https://github.com/robotology/icub-firmware/issues/174#issuecomment-795451305
I will produce a PR into robotology that will be merged after the tests on the robot.
What happened?
yarprobotinterface
asking the MC service to the boards caused a fatal error in the mc4plus
and ems
boards for multiple reasons:
mc4plus
because: function hal_quadencoder_init_indexes_flags()
was called without checking if the associated IRQ Handler was inactive (as it must be) before configuration of HW peripherals which the IRQ handler uses.ems
because: the pair of functions hal_spiencoder_init()
and hal_spiencoder_deinit()
did not manage RAM correctly and the hal_spiencoder_init()
consumed memory at each start of yarpbotinterface
. So, after a few start / stop cycles the board detected end of available RAM.I have also spotted a memory leak in the EOCurrentsWatchdog
module used by the mc4plus
board, which is however crashing first due to hal_quadencoder_init_indexes_flags()
.
Results of the tests of iCubGenova04: OK
Did extensive tests on iCubGenova04 during the the whole day of 19 mar 2021. Used binaries for ems
and mc4plus
complied w/ the option SPIENC_DEINIT_DEALLOCATE_HEAP
enabled.
Launched yarprobotinterface --config icub_wbd.xml
and then stopped it for > 20 times. Left the robot running at least a couple of times for > 30 minutes.
The boards did not disappear.
The current release of master instead caused the board to disappear after 2 or 3 srat / stop cycles.
The used binaries work fine.
What now:
rebased the fork in https://github.com/marcoaccame/icub-firmware/tree/fix/multiplestartofmc vs latest release of https://github.com/robotology/icub-firmware/tree/devel which has evolved so far and updated the application versions of ems
, mc4plus
, mc2plus
for the PR.
It may be worth a new quick test on the robot.
Nice @marcoaccame 👍🏻
Consider that we're quickly approaching a deadline for the new distro 2021.02.feat-01
.
cc @Nicogene
It may be worth a new quick test on the robot.
Tested once more on 24 mar 2021 after the rebase of icub-firmware, rebuilt the binaries to be sure the binaries in here were correct
The tests were OK, hence I merged the two PRs: https://github.com/robotology/icub-firmware-build/pull/25 and https://github.com/robotology/icub-firmware/pull/176
cc: @pattacini @maggia80 @S-Dafarra
Thanks @marcoaccame !
Closing, feel free to open it up again if needed.
Tested a mechanism which does the following:
mc4plus
enters in the handler of a fatal error, it writes some info in a log area placed in permanent RAM and then forces a restartThe log area cannnot hold more than 15 byte, so I will log:
s_osal_cfg_on_fatal_error()
, s_ipal_cfg_on_fatal_error()
, s_hal_core_cfg_on_fatalerror()
, various system HW handlers such as HardFault_Handler()
etc, ...)osal_error_stackoverflow
, osal_error_missingmemory
, etc.)I did a basic test for this mechanism, now I need to fill all the fatal error handlers and test some cases.
Then, I will release a new version for the mc4plus
application which can hopefully tell us more about what causes the problem.
tested w/ code such as this:
typedef struct
{
uint32_t millisecondsfromstart;
uint8_t handlertype;
uint8_t handlererrorcode;
uint8_t calledbyanirqhandler;
uint8_t idofthelastscheduledthread;
uint8_t forfutureuse0;
uint8_t forfutureuse1;
uint16_t signature;
} fatal_error_message_t; EO_VERIFYsizeof(fatal_error_message_t, 12)
typedef struct
{
uint64_t par64;
uint16_t par16;
} fatal_error_params_t; EO_VERIFYsizeof(fatal_error_params_t, 16)
typedef union
{
fatal_error_message_t message;
fatal_error_params_t params;
} fatal_error_t; EO_VERIFYsizeof(fatal_error_t, 16)
Code Listing: Data structures exchanged across a restart of the mc4plus
board.
#include "eEsharedServices.h"
static volatile fatal_error_t detectedfatalerror = {0};
static uint8_t detectedsize = 0;
// read ipc memory
if(ee_res_OK == ee_sharserv_ipc_userdefdata_get((uint8_t*)&detectedfatalerror, &detectedsize, sizeof(fatal_error_t)))
{
// there is something.
// i clear ipc ram
ee_sharserv_ipc_userdefdata_clr();
if((detectedsize <= sharserv_base_ipc_userdefdata_maxsize) && (0x1234 == detectedfatalerror.message.signature))
{
// it is of the correct size. i use it to send a diagnostic message
uint16_t par16 = detectedfatalerror.params.par16;
uint64_t par64 = detectedfatalerror.params.par64;
eOerrmanDescriptor_t errdes = {0};
errdes.code = eoerror_code_get(eoerror_category_Debug, eoerror_value_DEB_tag06);
errdes.sourcedevice = eo_errman_sourcedevice_localboard;
errdes.sourceaddress = 0;
errdes.par16 = par16;
errdes.par64 = par64;
eo_errman_Error(eo_errman_GetHandle(), eo_errortype_error, "RESTARTED after FATAL error", NULL, &errdes);
}
}
Code Listing: Code executed by the mc4plus
board at startup.. If a message in IPC memory is found then a diagnostic message is sent to yarprobotinferface.
// in here there is just a test of emission of ipc data w/ a restart
static fatal_error_t tobewritten = {0};
tobewritten.message.millisecondsfromstart = osal_system_ticks_abstime_get() / 1000; // better using another way so that we avoid calling an svc
tobewritten.message.handlertype = 1;
tobewritten.message.handlererrorcode = 3;
tobewritten.message.calledbyanirqhandler = 0;
tobewritten.message.idofthelastscheduledthread = 9;
tobewritten.message.forfutureuse0 = 1;
tobewritten.message.forfutureuse1 = 2;
tobewritten.message.signature = 0x1234;
// write in ipc memory and ... restart
ee_sharserv_ipc_userdefdata_set((uint8_t*)&tobewritten, sizeof(tobewritten));
ee_sharserv_sys_restart();
Code Listing: Code to be included in the fatal error handlers. It write some info in the IPC memory and then forces a restart.
This code was tested on an mc4plus
board. The effect is that the mc4plus sends a message such as this:
[ERROR] (EO? tsk2 @S5:m632:u723)-> {0x4000006 p16 0x0201, p64 0x0900030100001600, dev 0, adr 0}:
DEBUG: tag06. INFO = RESTARTED after FATAL error
Sadly, yarprobotinterface
will not print (yet) in human readable format, but it should be easy to decode p64 which contains in its lower 4 bytes the millisecondsfromstart
(0x00001600 = 5sec 632milli), then the handlertype
(0x01), the handlererrorcode
(0x03) etc.
The IPC messaging was designed years ago to carry up to 15 bytes. So far we use only 8 bytes, so we can maybe add some more without changing the IPC implementation. To extend to more bytes is surely feasible but it needs to reflash eLoader, eUpdater, eApplication.
We have just successfully tested a new FW for the mc4plus board which is able to send diganostics infor to yarprobotinetrfce in case of fatal errors.
In such a case, it forces a restart and prints info such as in the following (where we artificially forced a stack overflow from thread runDO after 20 sec of start of teh board)
[**INFO**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 953m 401u:
(code 0x0000003b, par16 0x0000 par64 0x0000000000000000) -> SYS: the board is bootstrapping + .
[**ERROR**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 7u:
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + RESTARTED after FATAL error
[**ERROR**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 117u:
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + @ 20000 ms
[**ERROR**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 235u:
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + handler OSAL, code 0xe5
[**ERROR**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 348u:
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + type osal_stackovf
[**ERROR**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 467u:
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + IRQHan SVCall Thread runDO
[**ERROR**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 581u:
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + ipsr 11, tid 11
List. Board 10.0.1.1 has detected a fatal error (first message of type DEBUG: tag00
w/ string RESTARTED after FATAL error
) at its execution time 20 sec (second message with string 20000 ms
). The third and fourth message tell that the error was caused by the OSAL handler and is due to stack overflow (see string ype osal_stackovf
). The handler was called by the IRQHandler SVCall which is the one which does thread switching and the error was caused by last scheduled thread called runDO (it is the one which ticks all teh services at 1 kHz.
In case of no error, teh FW works just fine as the latest devel version. Test were don today on iCubGenova09.
I shall soon produce a PR.
Hi @marcoaccame these are two logs containing the failures you're investigating in this issue:
Related issue: https://github.com/robotology/icub-tech-support/issues/673
Cool, thanks @GiulioRomualdi for reporting on this. @marcoaccame is off till the last week of August. We will then get back to this those days.
See https://github.com/robotology/icub-tech-support/issues/673#issuecomment-900634451 for the looming outdoor demos that would strongly benefit from a definite solution to this issue
Hi @marcoaccame these are two logs containing the failures you're investigating in this issue:
- this happens in the left pronosup
- while this for the pitch and roll joints of the head
Related issue: robotology/icub-tech-support#673
Hi @GiulioRomualdi, the logs are obtained w/ a FW version of the mc4plus which is older than the one described in here, so the logs don't contain any useful (new) information.
so, now we need to:
* ensure that the mc4plus fw is the latest version of devel
Just for general alignment, that version of the firmware was release as part of v2021.08.0 disto, so just using icub-firmware-build v1.21.0 is probably enough.
Following issue in here, we have identified a critical behaviour that should be investigated and solved: some
mc4plus
boards disappears from the ETH network when runningyarprobotinterface
.What happens is that an
mc4plus
board which offers the motion control service may stop transmitting and receiving UDP frames. The board maintains PWM actuation and seems to be stuck: it does not respond toping
and is not seen byFirmwareUpdater
.We have observed the board
10.0.1.24
to have such a misbehavior but also the board10.0.1.20
.A careful analysis is required using multiple investigation paths. Amongst them:
mc4plus
andems
in an attempt to spot bugs in the differences;mc4plus
board so that we can trace exact behaviour;ems
board;cc: @S-Dafarra @DanielePucci @maggia80