rbaron / b-parasite

🌱💧 An open source DIY soil moisture sensor
1.85k stars 143 forks source link

[zigbee sample] Fix rare infinite loop bug & introduce watchdog #150

Closed rbaron closed 11 months ago

rbaron commented 11 months ago

I think I finally managed to track down and fix the actual bug from https://github.com/rbaron/b-parasite/issues/113#issuecomment-1484903062, #126, #130, #145, #146.

Based on this SHTC3 lib, we had the following code in shtc3.c:

  while (i2c_read_dt(&shtc3, buff, 6) != 0) {
    k_msleep(10);
  }

In the rare occasion that there's an issue with the i2c signals from the sensor, this code can hang forever. If it does, the sensor reading callback get stuck, and it's not rescheduled. I managed to reproduce the issue in the zb-bug-repro branch.

This PR also introduces a watchdog, based on examples from the SDK:

I will let it roast for a couple of weeks.

Fixes #145.

cc @oleo65, @MJDSys, @AlbanDurrheimer

oleo65 commented 11 months ago

Awesome, one more step to more robustness. 😃

I will give it a try next time I have to change a battery somewhere and than will reflash the device.

AlbanDurrheimer commented 11 months ago

Good news, thank you. I'd also like to test it, but it's not practical to have to reconnect the flasher to my card (I don't have a pogo pin yet). I'll test it in a few weeks.

AlbanDurrheimer commented 11 months ago

Hello @rbaron, I wanted to test the modifications on my module, which is causing the problem.
When I flash this version (or the latest one of the main branch), my module no longer boots (the LED never lights up).
I've tried flashing earlier versions, and it's the modifications made by pull request #146 that cause my module to stop booting.
Do you have any idea what in #146 would prevent my module from starting (and therefore from lighting up the LED)?
Thank you

rbaron commented 10 months ago

@AlbanDurrheimer I would, in order, check:

  1. Logs. This will most likely tell you what's wrong
  2. Does it happen with a single board or different ones?
  3. Was the correct board version & nRF chip selected at build time?
  4. Did the flashing succeed until the end?
  5. [unlikely] Try fully erasing when uploading a new firmware

With the info you sent, my guess is that you have a faulty sensor, most likely the SHTC3. In #146, we introduced this line, which quickly reads/checks all sensors at boot. If you have a faulty sensor, this line will cause a reboot. This lets faulty boards fail earlier rather than later, which I think is a Good Thing.

AlbanDurrheimer commented 10 months ago

@rbaron Thank you for your reply.
I'll answer each point:

  1. Yes, I wanted to look at the logs, so I installed nRF Connect in VS Code (until now I was building with Docker and flashing with J-Flash Lite), but I must have missed something in my installation, because it doesn't flash or run the debug. I'll try again.
  2. I've updated my other boards with no problem.
  3. Yes, I've got the right version (I've been wrong before, so now it's the first thing I check).
  4. Yes, I managed to flash it and even several times to the end.
  5. I've also tried this solution before.

I've also tried flashing the blinky sample successfully.

I had understood that you had added a sensor check, which is why I wanted to ask. I wonder if it wouldn't also be interesting, when a sensor is detected as faulty, to use the LED to report a fault to the user before the reboot without having to connect it to a computer (example: LED lights up 3 times for a long time = fault no. 3).

EDIT: After editing the code to add logging, the program effectively crashes when reading SHTC3. I re-sodded the compossant and checked the continuity but it didn't change anything.

AlbanDurrheimer commented 10 months ago

I think I've managed to log something.

00> [00:00:00.000,457] ieee802154_nrf5: nRF5 802154 radio initialized 00> Booting Zephyr OS build fcaa60a99fa9 00> [00:00:00.003,417] littlefs: LittleFS version 2.5, disk version 2.0 00> [00:00:00.003,631] littlefs: FS at flash-controller@4001e000:0xfa000 is 6 0x1000-byte blocks with 512 cycle 00> [00:00:00.003,662] littlefs: sizes: rd 16 ; pr 16 ; ca 64 ; la 32 00> [00:00:00.005,432] littlefs: /lfs mounted 00> [00:00:00.546,478] i2c_nrfx_twi: Error on I2C line occurred for message 0 00> [00:00:00.546,539] shtc3: Error -5: in /code/prstlib/src/shtc3.c:19 00> [00:00:00.546,569] shtc3: Error -5: in /code/prstlib/src/shtc3.c:27 00> [00:00:00.546,569] sensors: Error -5: in /code/prstlib/src/sensors.c:15 00> ASSERTION FAIL [!prst_sensors_read_all(&sensors)] @ CMAKE_SOURCE_DIR/src/main.c:304 00> [00:00:00.546,630] os: r0/a1: 0x00000004 r1/a2: 0x00000130 r2/a3: 0x00000005 00> [00:00:00.546,661] os: r3/a4: 0x200012a0 r12/ip: 0x00000014 r14/lr: 0x000079f5 00> [00:00:00.546,661] os: xpsr: 0x41000000 00> [00:00:00.546,691] os: s[ 0]: 0x00000002 s[ 1]: 0x00000000 s[ 2]: 0x00000000 s[ 3]: 0x00000000 00> [00:00:00.546,722] os: s[ 4]: 0x00000000 s[ 5]: 0x00000000 s[ 6]: 0x00000000 s[ 7]: 0x00000000 00> [00:00:00.546,722] os: s[ 8]: 0x00000000 s[ 9]: 0x00000000 s[10]: 0x00000001 s[11]: 0x000003e8 00> [00:00:00.546,752] os: s[12]: 0x000001f0 s[13]: 0x00000000 s[14]: 0x0000ffff s[15]: 0x00000000 00> [00:00:00.546,783] os: fpscr: 0x20000010 00> [00:00:00.546,813] os: Faulting instruction address (r15/pc): 0x00044eb2 00> [00:00:00.546,844] os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0 00> [00:00:00.546,875] os: Current thread: 0x20002860 (unknown) 00> [00:00:01.057,128] fatal_error: Resetting system

AlbanDurrheimer commented 10 months ago

I've also just assembled a new board, and it has the same problem.
IMG_20230810_145126

AlbanDurrheimer commented 10 months ago

Sorry for spamming :sweat_smile:.
I think I've found the reason. Obviously, on my modules, the SHTC3 sensor, but always more than one loop before succeeding in being readable (d'according to my logs, it's the line RET_IF_ERR(write_cmd(PRST_SHTC3_CMD_WAKEUP)); that creates the error):

Before #146:

[00:00:00.003,448] littlefs: LittleFS version 2.5, disk version 2.0 [00:00:00.003,662] littlefs: FS at flash-controller@4001e000:0xfa000 is 6 0x1000-byte blocks with 512 cycle [00:00:00.003,692] littlefs: sizes: rd 16 ; pr 16 ; ca 64 ; la 32 [00:00:00.003,814] littlefs: WEST_TOPDIR/modules/fs/littlefs/lfs.c:1234: Corrupted dir pair at {0x0, 0x1} [00:00:00.003,814] littlefs: can't mount (LFS -84); formatting [00:00:00.176,269] littlefs: /lfs mounted [00:00:00.265,045] debug_counters: fs_read returned 0, expected 4, assuming first access [00:00:00.265,838] app: Dumping debug counters: [00:00:00.267,395] app: - boot: 1 [00:00:00.267,913] app: Updating sensors [00:00:00.808,380] i2c_nrfx_twi: Error on I2C line occurred for message 0 [00:00:00.808,441] shtc3: Error -5: in /code/prstlib/src/shtc3.c:19 [00:00:00.808,441] shtc3: Error -5: in /code/prstlib/src/shtc3.c:27 [00:00:00.808,471] sensors: Error -5: in /code/prstlib/src/sensors.c:15 [00:00:00.810,302] debug_counters: fs_read returned 0, expected 4, assuming first access [00:00:00.811,096] app: Unable to read sensors [00:00:01.711,669] zigbee_app_utils: Enabled sleepy end device behavior. [00:00:02.050,506] debug_counters: fs_read returned 0, expected 4, assuming first access [00:00:02.393,981] zigbee_app_utils: Production configuration is not present or invalid (status: -1) [00:00:02.394,226] zigbee_app_utils: Zigbee stack initialized [00:00:02.397,552] zigbee_app_utils: Device started for the first time [00:00:02.397,583] zigbee_app_utils: Start network steering [00:00:02.397,613] zigbee_app_utils: Started network rejoin procedure. [00:00:43.930,572] debug_counters: fs_read returned 0, expected 4, assuming first access [00:00:46.732,391] zigbee_app_utils: Network steering was not successful (status: -1) [00:01:00.864,044] app: Updating sensors [00:01:00.922,363] sensors: Batt: 2475 mV (7.40%) [00:01:00.922,393] sensors: Soil: 5 % (0 mV) [00:01:00.922,393] sensors: Photo: 100 lx (0 mV) [00:01:00.922,424] sensors: Temp: 26.566391 oC [00:01:00.922,424] sensors: Humi: 43 % [00:01:00.922,454] sensors: -------------------------------------------------- [00:01:32.080,383] zigbee_app_utils: Network steering was not successful (status: -1)

After (main branch) with initialize sensors no critical (for with critical error return to other comment): image

[00:00:00.003,753] littlefs: LittleFS version 2.5, disk version 2.0 [00:00:00.003,997] littlefs: FS at flash-controller@4001e000:0xfa000 is 6 0x1000-byte blocks with 512 cycle [00:00:00.003,997] littlefs: sizes: rd 16 ; pr 16 ; ca 64 ; la 32 [00:00:00.004,119] littlefs: WEST_TOPDIR/modules/fs/littlefs/lfs.c:1234: Corrupted dir pair at {0x0, 0x1} [00:00:00.004,150] littlefs: can't mount (LFS -84); formatting [00:00:00.176,849] littlefs: /lfs mounted [00:00:00.804,779] i2c_nrfx_twi: Error on I2C line occurred for message 0 [00:00:00.804,840] shtc3: Error -5: in /code/prstlib/src/shtc3.c:19 [00:00:00.804,840] shtc3: Error -5: in /code/prstlib/src/shtc3.c:29 [00:00:00.804,870] sensors: Error -5: in /code/prstlib/src/sensors.c:15 [00:00:00.806,701] debug_counters: fs_read returned 0, expected 4, assuming first access [00:00:00.807,556] app: Unable to read sensors [00:00:00.809,234] debug_counters: fs_read returned 0, expected 4, assuming first access [00:00:00.812,133] debug_counters: fs_read returned 0, expected 4, assuming first access [00:00:00.812,988] app: Dumping debug counters: [00:00:00.815,338] app: - boot: 1 [00:00:00.816,558] app: - reset_cause_software: 1 [00:00:00.818,054] app: - sensors_read_error: 1 [00:00:01.719,116] zigbee_app_utils: Enabled sleepy end device behavior. [00:00:02.058,868] debug_counters: fs_read returned 0, expected 4, assuming first access [00:00:02.402,282] zigbee_app_utils: Production configuration is not present or invalid (status: -1) [00:00:02.402,679] zigbee_app_utils: Zigbee stack initialized [00:00:02.407,958] zigbee_app_utils: Device started for the first time [00:00:02.407,958] zigbee_app_utils: Start network steering [00:00:02.408,020] zigbee_app_utils: Started network rejoin procedure. [00:00:03.409,027] debug_counters: fs_read returned 0, expected 4, assuming first access [00:00:03.473,419] sensors: Batt: 2692 mV (16.08%) [00:00:03.473,419] sensors: Soil: 5 % (0 mV) [00:00:03.473,449] sensors: Photo: 82 lx (0 mV) [00:00:03.473,480] sensors: Temp: 25.848083 oC [00:00:03.473,480] sensors: Humi: 43 % [00:00:03.473,510] sensors: -------------------------------------------------- [00:00:03.475,952] debug_counters: fs_read returned 0, expected 4, assuming first access [00:00:43.990,142] zigbee_app_utils: Network steering was not successful (status: -1)

I agree that faulty boards should be detected before they fail sooner rather than later, but in the case of my two boards, for some reason the SHTC3 component takes a loop before it works.

rbaron commented 10 months ago

@AlbanDurrheimer, that's interesting. I have two things on my mind:

  1. Possible intermittent hardware issue, since you have several boards and only two exhibit this behavior, and the fact that I cannot reproduce it. I can see from the photo that you have hand soldered the components (did you use a hot plate? or soldering iron?) -- the SHTC3 package is super hard to solder manually. If the SHT was factory assembled, I'd triple check the solder joints that connect it to the nRF module, and scrub off the leftover flux. It would make more sense if the issue was "works sometimes" and not "works on second call". I would run the following test: repeatedly reboot the board and see if it works at least once. Let's say 10-15 times. Do it by shorting the RST pin to GND, not removing power completely.

  2. Some timing issue with powering up the SHT sensor. How are you running the tests? If you are removing the power completely and reconnecting it, I was wondering if there's a power-up timing in the SHT. From the datasheet it seems like the power up time is super fast, measured in uS, so I find it unlikely:

image

Either way, a quick test for this is:

Let me know how it goes.

AlbanDurrheimer commented 10 months ago

@rbaron

I soldered everything by hand. For the SHTC3 I used, I soldered with a hot air flux. And as I said, on one of the two I've already desoldered and re-soldered the component to check the soldering.
So far, I've been removing and replacing the battery to restart the board. If I use the program identical to the main branch, the program already restarts in a loop. Even if I wait several minutes, the program can't read the SHTC3 (and therefore restarts). I've tried restarting the card several times with RST, with no difference.
I added a 1s delay (k_msleep(1000)) before the prst_sensors_read_all, but it didn't make any difference.
Given that the program returns an error on this line RET_IF_ERR(write_cmd(PRST_SHTC3_CMD_WAKEUP)); (and therefore on RET_IF_ERR(i2c_write_dt(&shtc3, cmd, sizeof(cmd)))), ~I wonder if the I2C connection is properly initialized when we try to use it (which would explain why, despite looped program reboot, it still doesn't work)~(tested with i2c_is_ready_dt and it's not that).

rbaron commented 10 months ago

@AlbanDurrheimer unfortunately I can't reproduce the issue, and given it only happens in two of your boards, the only explanation I can come up with is hardware related. But it does sound odd 😢 We pretty much use the same code from Nordic's tutorial for i2c IO, so I think it's unlikely it's something there. Please let us know if you manage to identify the problem.

AlbanDurrheimer commented 10 months ago

Hi @rbaron, Well, I haven't been able to determine the real cause... But since it's SHTC3's wake-up that returns the error when the sensors are read for the first time after a reset, and I haven't been able to solve it with a delay or I2C checks. I added a simple retry logic and so far, after several days and resets, it's working.

Beforce:

  // Wake the sensor up.
  RET_IF_ERR(write_cmd(PRST_SHTC3_CMD_WAKEUP));
  k_msleep(1);

After:

  // Wake the sensor up.
  if (write_cmd(PRST_SHTC3_CMD_WAKEUP))
  {
    LOG_INF("SHTC3 wake-up fails, retry...");
    k_msleep(1);
    RET_IF_ERR(write_cmd(PRST_SHTC3_CMD_WAKEUP));
  }
  k_msleep(1);