Some node fail on long term experiments

keomabrun commented 8 years ago

After one our on a large experiment (>350 nodes), the following nodes are not responding:

-----------timeout--------------m3-220
-----------timeout--------------m3-1
-----------timeout--------------m3-30
-----------timeout--------------m3-189
-----------timeout--------------m3-286
-----------timeout--------------m3-285
-----------timeout--------------m3-283
-----------timeout--------------m3-282
-----------timeout--------------m3-281
-----------timeout--------------m3-280
-----------timeout--------------m3-289
-----------timeout--------------m3-288
-----------timeout--------------m3-306
-----------timeout--------------m3-307
-----------timeout--------------m3-304
-----------timeout--------------m3-305
-----------timeout--------------m3-302
-----------timeout--------------m3-303
-----------timeout--------------m3-301
-----------timeout--------------m3-308

ddujovne commented 8 years ago

That is the same behavior we observed on large scale experiments using wsn430 nodes. We thought this had to do with network overload issues or specific hardware instabilities. Can you test if this happens with less nodes or if the same node ids repeat between different experiments?

keomabrun commented 8 years ago

Alright, I will try that tonight. Thank you for the comment.

keomabrun commented 8 years ago

I am currently running an experiment with 85 nodes since 10 min and one node is already not reachable. I will re-run the experiment later to see if the same nodes are creating errors.

keomabrun commented 8 years ago

Some nodes tend to have more probable failures but I don't see any exact pattern.

ddujovne commented 8 years ago

We also thought that there may be a problem when flashing the firmware on the microcontrollers. My initial bootstrap process included an identification of the failing nodes and an individual reflash in two passes, then discarding the ones which failed on the second try. This reduced the number of failed nodes, but did not reduce it to zero. I wonder if there is a way to add an MD5 checksum within the firmware to test the integrity of the flash image. This hypothesis is based on the idea that there may be a problem either during firmware distribution to the node supervisors (which I suppose flash the firmware on each node) or on the supervisor flash routine, which may not check for integrity after flashing. The third hypothesis may come from a communication error between the supervisor and the node, or between the supervisor and the central controller, which may arise on any node due to a bug.

twatteyne commented 8 years ago

@keomabrun, @ddujovne,

I believe Mercator stresses the IoT-lab a bit because it involves so much serial activity. While it's to me unbelievable that this wouldn't work, it's possible that we are the first users to use the serial port as much, and that insufficient testing was done.

I think the problem comes form the IoT-lab infrastructure itself. For example (if I have to guess), something goes wrong in the serial port forwarding and it stop actually forwarding, making the mote "disappear" from a serial port point of view after some bug. I trust the fact that the code on the mote stays on the motes, but I have very little trust in the million things that are around it.

It the mote is still alive, it will still be sending packets of the air, you just won't be able to communicate with it over its serial port. One experiment you could do is have the motes sends small packets with their address every now and then, listen the rest of the time, and write over their serial port the other motes it heard. If you stop hearing from a mote over its serial port, but that mote's address still appears in the serial port notifications of at least one other mote, the problem is serial.

@keomabrun, can I ask you to coordinate closely with the IoT-lab engineers ASAP. They will understand the problem, we can only guess and waste time. Send unicast e-mails to Frederic Saint-Marcel and Gaetan Harter and maybe call them (number in Inria address book) and explain what you see.

Thomas

keomabrun commented 8 years ago

@twatteyne,

It the mote is still alive, it will still be sending packets of the air,

If the mote is still alive, it might also be in idle or receiving state and not sending anything.

One experiment you could do is have the motes sends small packets with their address every now and then

Will this experiment stress the serial enough ?

@keomabrun, can I ask you to coordinate closely with the IoT-lab engineers ASAP

Will do today.

Thank you,

keomabrun commented 8 years ago

When I remove the number of messages sent by not requesting the motes status after each state changes (Rx to Idle, Idle to Tx ...), and set the maximum response timeout to 10s, no mote are lost. This is validated on a 92-mote experiment.

ddujovne commented 8 years ago

So a tunable inter-packet delay should work?

keomabrun commented 8 years ago

I did not change the inter-packet. It seems to come from a serial limitation and not a wireless or firmware one.

keomabrun commented 8 years ago

Sorry, I undestand your point now. I will try that. Thanks.

ddujovne commented 8 years ago

You reduced the number of packets arriving to the nodes by not requesting the change from one state to the following. What I propose is to test if generating less packets per second (or slowing down the experiment) it would run smoothly.

keomabrun commented 8 years ago

10ms inter-transaction does not work with 88 nodes. 100ms does not work either with 88 nodes (after more time). 500ms inter-transaction + 500 inter-packet delay does not work either (node crash).

One question: do we need to check the state after each state modification ?

There are actually two things occurring:

temporary timeout: the serial is slow
persistent timeout: the mote crashed And we don't know if the two are linked.

I also wonder if the serial total capacity is not shared with the other experiment running in the lab. I am waiting for the IoT-Lab admins answer.

twatteyne commented 8 years ago

temporary timeout: the serial is slow

@keomabrun, I guess by now you know what I'll say :-) If the serial doesn't work well on IoT-lab, we shouldn't modify our firmware to work-around the problem, but tell the engineers in charge of the platform to make something that works. So thanks for coordinating with the guys!

persistent timeout: the mote crashed And we don't know if the two are linked.

Hmm, that we should certainly fix! One very first thing is to make sure that, when a mote "crashes", it resets. When resetting, the mote should write a "reset notification" on its serial port. This will allow us to monitor that/when motes are resetting, and make verify that a future actual fix works.

ddujovne commented 8 years ago

In order to reset the node when a crash happens, wouldn't it be good to use the Watchdog timer? There should be a register where the reset cause is kept, one (or a group) of the bits should be reserved to flag the watchdog timer action.

twatteyne commented 8 years ago

Not really. What it means is simply to have the default interrupt handler reset the board.

A crash happens most likely because the code is (wrongly) trying to access some memory location which doesn't exsist (usually 0x00000000, because the code tries to de-reference an uninitialized pointer). This causes an non-maskable interrupt (NMI). Typically, low level drivers which you download from the Internet use the same "default handler" for all interrupt sources, which your code then overwrites with more specific handlers. If that default handler contains a while(1) statement, the chip hangs.

I'm suggesting either (1) not to use a default handler, and let the interrupt vector set to all 0's, or (2) use the default handler but have it reset the board.

keomabrun commented 8 years ago

The interuption is set here: https://github.com/openwsn-berkeley/openwsn-fw/blob/develop/bsp/boards/iot-lab_M3/configure/stm32f10x_it.c According to @changtengfei, HardFault_Handler should override the MemManage_Handler, so the mote should reset.

However, in this file: https://github.com/openwsn-berkeley/openwsn-fw/blob/develop/bsp/boards/iot-lab_M3/board.c In the board_enableHardFaultExceptionHandler function (at the bottom), the bit3 that correspond to "trapping unaligned memory access" is not set to 1 (contrarily to what the comments indicates).

The M3 specifications: http://www.st.com/content/ccc/resource/technical/document/programming_manual/5b/ca/8d/83/56/7f/40/08/CD00228163.pdf/files/CD00228163.pdf/jcr:content/translations/en.CD00228163.pdf (page 138, UNALIGN_ TRP)

I will try to change that now.

keomabrun commented 8 years ago

Even after changing the UNALIGN bit to 1, there are still motes that crash. I will try enabling the other bits.

openwsn-berkeley / mercator

Some node fail on long term experiments #7