meshtastic / firmware

Meshtastic device firmware
https://meshtastic.org
GNU General Public License v3.0
3.63k stars 909 forks source link

[Bug]: config file corruption due to filesystem size #4184

Closed geeksville closed 3 months ago

geeksville commented 5 months ago

Category

Other

Hardware

Other

Firmware Version

2.3.14.c67a9dfe

Description

I bet it is not wm1110 specific. Occurred while doing hundreds of power cycles.

I bet the best way to find/fix it is to turn off our "if config read fails or is corrupted completely factory reset and try again". Instead we should spin inside the ICE debugger.

Notes from chat:

@thebentern re: the mystery "lora." config got toasted thing that happened to mw wm1110 board happened again. just fyi, I'll keep an eye on it and add more instrumentation while I'm doing my other stuff but possibly there is some badness somewhere. Possibly also I'm just inadvertently stress testing because I'm cycling this board through >100 power cycles in different configs (but none of that should have led us to corrupt our flash fs). I only noticed because my power.powermon_enables field also got toasted. thebentern — Today at 6:26 PM I covet your experienced eyes on that issue, because so far it's been elusive and seemingly random geeksville — Today at 6:37 PM hmm - rather than mystery corruption I wonder if there is a bug in the adafruit nrf52 fake filesystem stuff. after I finish power crap (about another week?) i'll try to make a robust stress test and leave it running while ICEd.

Relevant log output

No response

geeksville commented 5 months ago

I might (hopefully will) look into this in a week or two.

geeksville commented 5 months ago

This same corruption probably exposed #4167 a couple of days ago.

geeksville commented 5 months ago

hmm this is less that perfect (though quite possibly unrelated to the problem):

        // brief window of risk here ;-)
        if (FSCom.exists(filename) && !FSCom.remove(filename)) {
            LOG_WARN("Can't remove old pref file\n");
        }
        if (!renameFile(filenameTmp.c_str(), filename)) {
            LOG_ERROR("Error: can't rename new pref file\n");
        }

We could eliminate this window of risk by renaming the file.new to be file.good, then remove file, then rename file.good to be filename (a 3 stage commit). Then at load time if we ever see a file.good existing, we know that we lost power during that window and file.good should be used instead of file (and file should be deleted at that point.

But this might not actually be the bug, so I'll wait until I look into this and somehow make a reproducable failure.

ImprobableStudios commented 4 months ago

I believe I am seeing the same issue on a RAK19007/RAK4631 (2 of them actually). I believe the issue is specific to nRF builds. I have roughly 10 devices (all running the same version - 2.3.13) and only the two nRF based modules have run into this.

This board is for use in a solar node (bought the second to troubleshoot and eliminate hardware), when connected via USB-C to a host (capturing serial log), the issue does not occur, but once I close it up and place it outside, the issue occurs after some seemingly random amount of time. I don't believe it is the battery going low (causing a reset), when I reattach the device, the battery is general around 90%, but appears to have experienced a factory-reset. I suspected corruption, but have not been able to capture a log of it occurring.

Open to suggestions - considered connecting an ESP32 (on a separate power source) to the serial lines and using it as a uart->wifi bridge, I suspect this is the only way I'll capture it as I have had it occur 6 times as described above, and never when attached via USB-C.

geeksville commented 4 months ago

@ImprobableStudios thank you for your report. I'm eager to help look into this but alas I'm doing powermon/structured logging/power measurements for another weekish. @thebentern and I have been talking about ideas related to this but I don't think anyone has really dove into this yet.

geeksville commented 4 months ago

and some copypasta from our chat so I don't forget it:

a slight reformatting of the list of options I could think of (ya'll can think of others possibly!):

it seems to me if there is a LFS corruption bug there are four robust possible fixes:

  1. find and fix the bug (possibly hard but probably best fix)
  2. Keep two separate LFS filesystems (one of which is tiny and used only for the critical config object). If the main FS shows corruption, wipe it and copy from our backup config object filesystem. Only write to the backup FS rarely (when we save a modified config object) possibly formatting that FS for every write? Or checking readback of that file after the write?
  3. Keep one LFS FS (like current) but: after writing the config object (due to normal changes), unmount the FS, mount it and read the file back to confirm it is still good. At that point (not during bootup - when we've already lost access to our critical config state) wipe the FS and rewrite the config object to it (so at least we haven't thrown away a lot of critical settings). Bonus: this solution would allow us to keep stats on num of times we've encountered this problem in the field.
  4. Use a different FS implementation that doesn't have the bug

I think options 1, 3 or 4 is what I'd probably lean towards (though I haven't spent any time getting a reproducable case or looking at the implementation).

I'm leaning towards quickly implementing 3 and then constructing a test case so that 1 could be acheived.

ImprobableStudios commented 4 months ago

Heh, funny... Before reading your post, I was also thinking something 2-like would be a more hardened solution, especially if you unmounted it when not in use. My thinking was store a backup of the core-config files in that partition when the config is updated leaving the NodeDb and other collected data to be wiped in the case of a failure. This, in theory, would be pretty solid as long as the issue isn't something overwriting the partition boundaries or something nasty like that.

After seeing the issue I did check and saw that nRF builds use a different underlying file system than ESP, which would help to give credence to this being an issue in the FS itself.... while I agree option 1 would be the absolute best - I also think it will be the most difficult...

Something else this makes me think about that would be nice to have is a system-telemetry set for events in the system. I design high-end sensors (think $3000 IMUs) and recently built something similar into the next generation. Here it would require PB work, etc., but keep a small telemetry partition - does not need a file system just a set of 32-bit (?) values for different events such as watchdog reset, radio errors, file-system errors, overall "uptime" since last factory-reset, etc. Would allow for establishing how well a node is performing before sticking it someplace inaccessible... Flash wear is the problem with it and why I think it is something that should only be on for test and node/fw-build eval...

I might have to put some thought into that.

thebentern commented 4 months ago

One potential caveat to option 2 is that we'll have to explore is where the BLE bonds are stored by BlueFruit, as restoring the core configuration is only half of the experience. A failed bluetooth auth flow requiring the user to forget and re-pair the device to their phone will still be an issue.

geeksville commented 4 months ago

Hmm - the adafruit fork of littlefs is quite old (last commit to their lfs.c was five years ago and the main littlefs project has many fixes which are not in the adafruit nrf52 tree... hmm...

(just adding this link for later reference: https://github.com/littlefs-project/littlefs )

I still think one of the first steps is to make a stress test that shows the problem, so that we can confirm we fixed it (after fixing whatever it is ;-)).

Also it might be that we need to be very careful to partition our BLE operations (in particular the FromNum notify call) from other LFS operations? Also see these interesting comment re shared directory structures: https://github.com/adafruit/Adafruit_nRF52_Arduino/pull/372#issuecomment-565705303

garthvh commented 4 months ago

Connecting to the public MQTT can cause the issue as well.

geeksville commented 4 months ago

Some other misc comments for when someone gets to looking at this:

geeksville commented 3 months ago

btw: the 'good' news is that I can pretty easily reproduce this bug by power cycling a rak4630 (noticed while doing my power testing). I'm still planning on working on this issue once power stuff is done.

geeksville commented 3 months ago

oh: interesting: I can cause this corruption by performing a filesystem write 100% of the time on rak4630 by doing a write while the battery voltage is down at 3.2V.

geeksville commented 3 months ago

imo this is definitely it. over the next couple of days I'll add some instrumentation (both with the oscope and in sw to print vcc voltage via a ADC). Then I'll figure out why our brown-out detector didn't kill the processor before we ever reached this flash write. Possibly the voltage is barely above the (current) brown-out threshold and the higher current draw of doing the flash writes is enough to drop V below it. Thus almost guaranteeing flash corruption.

@garth in my case I was working on getting system-off state working and in that test on the way to deep sleep I'm doing the FS writes (essentially the same code as the normal reboot case). But I'm doing it while hooked to a programmable bench supply that was repeating the test at a series of voltages. Once at 3.3Vish things start to go bad (because the LDO on the rak board requires a min 3.3V input to meet spec - I bet it's output voltage to the processor in that case begins to get real-crummy real-fast). I'll know better once I can read that voltage in sw and with a blue wire.

geeksville commented 3 months ago

Important Note: BatmansWang on discord reports having corruption without low batteries so there might be something else going on (possibly not related to voltage, possibly still related - tbd)

geeksville commented 3 months ago

Loss of config due to filesystem size limits

(Any comments/advice on the notes below are apprecated!) (This is distinct from the 'loss of config due to low-power filesystem corruption' problem discussed elsewhere)

I think I found another (far more common in the field probably) cause of filesystem corruption (and caused this in my test config). If you are in an area that can see a lot of nodes it is possible for your local node DB to reach the maximum size of 100 nodes. This means that the final size of db.proto on the disk would be 16600 bytes + a few kb for the other stuff in deviceState. Which is a bummer because on the nrf52 our filesystem size is only 28672.

We always write pref files in an atomic fashion (first writing the new file then deleting the old version). Since 2x16600 it guarantees in that case that the write will fail (because bigger than the filesystem).

This causes two failures - one less serious and the other which can cause loss of all device state (even at less than 100 nodes).

case 1: The less serious case (but still probably catastrophic)

case 2: The more serious case:

For reference here's some current file sizes as my node was slowly learning the mesh (in a busy area that can certainly see 100 nodes)

DEBUG | 15:15:51 2687 [Power] File db.proto (size 7000)
DEBUG | 15:15:51 2687 [Power] File config.proto (size 99)
DEBUG | 15:15:51 2687 [Power] File module.proto (size 101)
DEBUG | 15:15:51 2687 [Power] File channels.proto (size 57)
DEBUG | 15:15:51 2687 [Power] 5 files (total size 7833)

Fix description

I've partially completed the fixes for this 'non low power related' fault. Summary of the fixes I'm making:

docs for SafeFile

/**
 * This class provides 'safe'/paranoid file writing.
 *
 * Some of our filesystems (in particular the nrf52) may have bugs beneath our layer.  Therefore we want to
 * be very careful about how we write files.  This class provides a restricted (Stream only) writing API for writing to files.
 *
 * Notably:
 * - we keep a simple xor hash of all characters that were written.
 * - We do not allow seeking (because we want to maintain our hash)
 * - we provide an close() method which is similar to close but returns false if we were unable to successfully write the
 * file.  Also this method
 * - atomically replaces any old version of the file on the disk with our new file (after first rereading the file from the disk
 * to confirm the hash matches)
 * - Some files are super huge so we can't do the full atomic rename/copy (because of filesystem size limits).  If !fullAtomic then
 * we still do the readback to verify file is valid so higher level code can handle failures.
 */
class SafeFile : public Print
...
geeksville commented 3 months ago

ok all the stuff needed for nrf52 is done. Tomorrow I'll test on esp32 (and check esp32 filesystem size) then send in the PR.

geeksville commented 3 months ago

Test results on ESP32

ESP32 very unlikely to exhibit this bug, because the filesystem there is huge and empty.

INFO  | 16:54:13 64 [Power] File /prefs/channels.proto (size 57)
INFO  | 16:54:13 64 [Power] File /prefs/config.proto (size 115)
INFO  | 16:54:13 64 [Power] File /prefs/db.proto (size 8442)
INFO  | 16:54:13 64 [Power] File /prefs/module.proto (size 94)
INFO  | 16:54:13 64 [Power] 4 files (total size 8708)
DEBUG | 16:54:13 64 [Power] Filesystem files (28672/1048576 Bytes):