geeksville commented 5 months ago

Hardware

Other

Firmware Version

2.3.14.c67a9dfe

Description

I bet it is not wm1110 specific. Occurred while doing hundreds of power cycles.

I bet the best way to find/fix it is to turn off our "if config read fails or is corrupted completely factory reset and try again". Instead we should spin inside the ICE debugger.

Notes from chat:

@thebentern re: the mystery "lora." config got toasted thing that happened to mw wm1110 board happened again. just fyi, I'll keep an eye on it and add more instrumentation while I'm doing my other stuff but possibly there is some badness somewhere. Possibly also I'm just inadvertently stress testing because I'm cycling this board through >100 power cycles in different configs (but none of that should have led us to corrupt our flash fs). I only noticed because my power.powermon_enables field also got toasted. thebentern — Today at 6:26 PM I covet your experienced eyes on that issue, because so far it's been elusive and seemingly random geeksville — Today at 6:37 PM hmm - rather than mystery corruption I wonder if there is a bug in the adafruit nrf52 fake filesystem stuff. after I finish power crap (about another week?) i'll try to make a robust stress test and leave it running while ICEd.

Relevant log output

No response

geeksville commented 5 months ago

I might (hopefully will) look into this in a week or two.

geeksville commented 5 months ago

This same corruption probably exposed #4167 a couple of days ago.

geeksville commented 5 months ago

hmm this is less that perfect (though quite possibly unrelated to the problem):

        // brief window of risk here ;-)
        if (FSCom.exists(filename) && !FSCom.remove(filename)) {
            LOG_WARN("Can't remove old pref file\n");
        }
        if (!renameFile(filenameTmp.c_str(), filename)) {
            LOG_ERROR("Error: can't rename new pref file\n");
        }

We could eliminate this window of risk by renaming the file.new to be file.good, then remove file, then rename file.good to be filename (a 3 stage commit). Then at load time if we ever see a file.good existing, we know that we lost power during that window and file.good should be used instead of file (and file should be deleted at that point.

But this might not actually be the bug, so I'll wait until I look into this and somehow make a reproducable failure.

ImprobableStudios commented 4 months ago

I believe I am seeing the same issue on a RAK19007/RAK4631 (2 of them actually). I believe the issue is specific to nRF builds. I have roughly 10 devices (all running the same version - 2.3.13) and only the two nRF based modules have run into this.

This board is for use in a solar node (bought the second to troubleshoot and eliminate hardware), when connected via USB-C to a host (capturing serial log), the issue does not occur, but once I close it up and place it outside, the issue occurs after some seemingly random amount of time. I don't believe it is the battery going low (causing a reset), when I reattach the device, the battery is general around 90%, but appears to have experienced a factory-reset. I suspected corruption, but have not been able to capture a log of it occurring.

Open to suggestions - considered connecting an ESP32 (on a separate power source) to the serial lines and using it as a uart->wifi bridge, I suspect this is the only way I'll capture it as I have had it occur 6 times as described above, and never when attached via USB-C.

geeksville commented 4 months ago

@ImprobableStudios thank you for your report. I'm eager to help look into this but alas I'm doing powermon/structured logging/power measurements for another weekish. @thebentern and I have been talking about ideas related to this but I don't think anyone has really dove into this yet.

geeksville commented 4 months ago

and some copypasta from our chat so I don't forget it:

a slight reformatting of the list of options I could think of (ya'll can think of others possibly!):

it seems to me if there is a LFS corruption bug there are four robust possible fixes:

find and fix the bug (possibly hard but probably best fix)
Keep two separate LFS filesystems (one of which is tiny and used only for the critical config object). If the main FS shows corruption, wipe it and copy from our backup config object filesystem. Only write to the backup FS rarely (when we save a modified config object) possibly formatting that FS for every write? Or checking readback of that file after the write?
Keep one LFS FS (like current) but: after writing the config object (due to normal changes), unmount the FS, mount it and read the file back to confirm it is still good. At that point (not during bootup - when we've already lost access to our critical config state) wipe the FS and rewrite the config object to it (so at least we haven't thrown away a lot of critical settings). Bonus: this solution would allow us to keep stats on num of times we've encountered this problem in the field.
Use a different FS implementation that doesn't have the bug

I think options 1, 3 or 4 is what I'd probably lean towards (though I haven't spent any time getting a reproducable case or looking at the implementation).

I'm leaning towards quickly implementing 3 and then constructing a test case so that 1 could be acheived.

ImprobableStudios commented 4 months ago

Heh, funny... Before reading your post, I was also thinking something 2-like would be a more hardened solution, especially if you unmounted it when not in use. My thinking was store a backup of the core-config files in that partition when the config is updated leaving the NodeDb and other collected data to be wiped in the case of a failure. This, in theory, would be pretty solid as long as the issue isn't something overwriting the partition boundaries or something nasty like that.

After seeing the issue I did check and saw that nRF builds use a different underlying file system than ESP, which would help to give credence to this being an issue in the FS itself.... while I agree option 1 would be the absolute best - I also think it will be the most difficult...

Something else this makes me think about that would be nice to have is a system-telemetry set for events in the system. I design high-end sensors (think $3000 IMUs) and recently built something similar into the next generation. Here it would require PB work, etc., but keep a small telemetry partition - does not need a file system just a set of 32-bit (?) values for different events such as watchdog reset, radio errors, file-system errors, overall "uptime" since last factory-reset, etc. Would allow for establishing how well a node is performing before sticking it someplace inaccessible... Flash wear is the problem with it and why I think it is something that should only be on for test and node/fw-build eval...

I might have to put some thought into that.

thebentern commented 4 months ago

One potential caveat to option 2 is that we'll have to explore is where the BLE bonds are stored by BlueFruit, as restoring the core configuration is only half of the experience. A failed bluetooth auth flow requiring the user to forget and re-pair the device to their phone will still be an issue.

geeksville commented 4 months ago

Hmm - the adafruit fork of littlefs is quite old (last commit to their lfs.c was five years ago and the main littlefs project has many fixes which are not in the adafruit nrf52 tree... hmm...

(just adding this link for later reference: https://github.com/littlefs-project/littlefs )

I still think one of the first steps is to make a stress test that shows the problem, so that we can confirm we fixed it (after fixing whatever it is ;-)).

Also it might be that we need to be very careful to partition our BLE operations (in particular the FromNum notify call) from other LFS operations? Also see these interesting comment re shared directory structures: https://github.com/adafruit/Adafruit_nRF52_Arduino/pull/372#issuecomment-565705303

garthvh commented 4 months ago

Connecting to the public MQTT can cause the issue as well.

geeksville commented 4 months ago

Some other misc comments for when someone gets to looking at this:

Check the adafruit threads above wrt the relationship between Bluefruit (and SoftDevice?) LFS usage and the adafruit usage. Do they share the same FS or different? Are the comments about remaining mutex problems between those two usages still relevant?
Construct a stress test that repeatedly does similar FS operations to what we do. See if it can fail. Then add random power cycling to the mix and see if we can create a failure. If still no failure add BLE operations from a python stress-test client (in particular DCCC notify API calls if those old adafruit comments are still relevant). I bet some combination of the three of these we can create a stress test to show the failure.
Based on which operations can cause the failure either add more careful mutual exclusion or test a newer LFS. Then retest with stress-test to confirm fixed.

geeksville commented 3 months ago

btw: the 'good' news is that I can pretty easily reproduce this bug by power cycling a rak4630 (noticed while doing my power testing). I'm still planning on working on this issue once power stuff is done.

geeksville commented 3 months ago

oh: interesting: I can cause this corruption by performing a filesystem write 100% of the time on rak4630 by doing a write while the battery voltage is down at 3.2V.

geeksville commented 3 months ago

imo this is definitely it. over the next couple of days I'll add some instrumentation (both with the oscope and in sw to print vcc voltage via a ADC). Then I'll figure out why our brown-out detector didn't kill the processor before we ever reached this flash write. Possibly the voltage is barely above the (current) brown-out threshold and the higher current draw of doing the flash writes is enough to drop V below it. Thus almost guaranteeing flash corruption.

@garth in my case I was working on getting system-off state working and in that test on the way to deep sleep I'm doing the FS writes (essentially the same code as the normal reboot case). But I'm doing it while hooked to a programmable bench supply that was repeating the test at a series of voltages. Once at 3.3Vish things start to go bad (because the LDO on the rak board requires a min 3.3V input to meet spec - I bet it's output voltage to the processor in that case begins to get real-crummy real-fast). I'll know better once I can read that voltage in sw and with a blue wire.

geeksville commented 3 months ago

Important Note: BatmansWang on discord reports having corruption without low batteries so there might be something else going on (possibly not related to voltage, possibly still related - tbd)

geeksville commented 3 months ago

Loss of config due to filesystem size limits

(Any comments/advice on the notes below are apprecated!) (This is distinct from the 'loss of config due to low-power filesystem corruption' problem discussed elsewhere)

I think I found another (far more common in the field probably) cause of filesystem corruption (and caused this in my test config). If you are in an area that can see a lot of nodes it is possible for your local node DB to reach the maximum size of 100 nodes. This means that the final size of db.proto on the disk would be 16600 bytes + a few kb for the other stuff in deviceState. Which is a bummer because on the nrf52 our filesystem size is only 28672.

We always write pref files in an atomic fashion (first writing the new file then deleting the old version). Since 2x16600 it guarantees in that case that the write will fail (because bigger than the filesystem).

This causes two failures - one less serious and the other which can cause loss of all device state (even at less than 100 nodes).

case 1: The less serious case (but still probably catastrophic)

If this db.proto write fails because of not checking okay in https://github.com/meshtastic/firmware/blob/1a1d545c38bfb3daa298d4aee03ec97f51d1b3ee/src/mesh/NodeDB.cpp#L701-L717 we will delete the old (valid) db and rename the not-fully-written&invalid (but max size for what the FS could allow) as db.proto. The remaining pref writes will probably succeed because we just freed up >10KB on the FS via that delete. So the later config.proto (super critical) write will go fine.
At the next boot we will read this (guaranteed partially invalid) db.proto file. Fortunately the 'big' portion in the protobuf (the array of nodedblite) will be last. But we are still out of luck because the protobuf read may return an error code: because odds are super high that the last written nodeinfo is only partial and the will fail protobuf decode.
When that happens the current code wipes everything. https://github.com/meshtastic/firmware/blob/4b4c1669a964c7e3f543952a2633db09afbbf2a1/src/mesh/NodeDB.cpp#L611-L616

case 2: The more serious case:

Our order of writes is db.proto (not critical) then config.proto (this one is super critical) then module.proto (slightly critical) and channels.proto (medium critical)
Consider the case where we are adding new nodes as we see them in the mesh. Each new node needs 166 bytes on the disk for storage.
(It is worth noting that the node record is approximately the size the 128 littlefs blocks, so as we add nodes we 'nicely' can have high likelyhood of starving out later files)
Writes of the nodedb to the FS normally occur at shutdown (and periodically once per day? I forget and haven't checked yet). Possibly something in the interaction with the android app does something as well that triggers a nodedb write (I haven't checked yet)
If you are unlucky, the db will be written while the db.proto is still small enough to successfully write, but the later files (in particular config.proto) will find the filesystem now full.
When this happens the bug related to okay handling in the lines linked above will guarantee that the old (valid) config file is deleted and the new partially written (guaranteed invalid) file will be kept as successful.
This set of write failures will keep happening (and be undetected) until either the node is rebooted for some reason or case 1 begins to occur. If case 1 happens there will again be enough space for the critical config.proto file and things will be fixed (though the nodedb will be empty next reboot)

For reference here's some current file sizes as my node was slowly learning the mesh (in a busy area that can certainly see 100 nodes)

DEBUG | 15:15:51 2687 [Power] File db.proto (size 7000)
DEBUG | 15:15:51 2687 [Power] File config.proto (size 99)
DEBUG | 15:15:51 2687 [Power] File module.proto (size 101)
DEBUG | 15:15:51 2687 [Power] File channels.proto (size 57)
DEBUG | 15:15:51 2687 [Power] 5 files (total size 7833)

Fix description

I've partially completed the fixes for this 'non low power related' fault. Summary of the fixes I'm making:

move the atomic file stuff/file checking stuff into a new class/file SafeFile.h/cpp. (this has some good paranoia about file checking before delete/rename see description below)
cleanup NodeDB.cpp to use SafeFile
At load time consider inability to read db.proto non-fatal
For the db.proto write, set fullAtomic to false - which should guarantee that this bug can not occur anymore (even if we were super unlucky and lost power while writing db.proto, the loss of that one file is non fatal)
Move db.proto writing to the end of of the list of files to write (so that those other much more critical files have first shot at filesystem capacity).
change all the various save methods in nodeDB to properly preserve and pass up the boolean that indicates successful writes.
If there really is a filesystem corruption problem at the nrf52 layer (see below for discussion of that possible issue/non-issue) because the fault indication is now passed all the way up to NodeDB::saveToDisk() at that level if any of the above writes fail: reformat the FS and write all files from scratch (I've verified that only 'our' files are on this filesystem). It is important to do this at this point, while we still have valid config at RAM.
Change the load code to not wipe config if db reads fail (still wipe for the other failures)
If we ever detect file corruption generate a CriticalError. The node will keep working fine but a message will appear on the screen (and be passed to the client apps) so that hopefully we'll receive reports from the field that there is a remaining filesystem problem still existing. (Though notably: with these fixes even if there is yet another problem out there the odds we lose node config is now far lower)

docs for SafeFile

/**
 * This class provides 'safe'/paranoid file writing.
 *
 * Some of our filesystems (in particular the nrf52) may have bugs beneath our layer.  Therefore we want to
 * be very careful about how we write files.  This class provides a restricted (Stream only) writing API for writing to files.
 *
 * Notably:
 * - we keep a simple xor hash of all characters that were written.
 * - We do not allow seeking (because we want to maintain our hash)
 * - we provide an close() method which is similar to close but returns false if we were unable to successfully write the
 * file.  Also this method
 * - atomically replaces any old version of the file on the disk with our new file (after first rereading the file from the disk
 * to confirm the hash matches)
 * - Some files are super huge so we can't do the full atomic rename/copy (because of filesystem size limits).  If !fullAtomic then
 * we still do the readback to verify file is valid so higher level code can handle failures.
 */
class SafeFile : public Print
...

geeksville commented 3 months ago

ok all the stuff needed for nrf52 is done. Tomorrow I'll test on esp32 (and check esp32 filesystem size) then send in the PR.

geeksville commented 3 months ago

Test results on ESP32

ESP32 very unlikely to exhibit this bug, because the filesystem there is huge and empty.

INFO  | 16:54:13 64 [Power] File /prefs/channels.proto (size 57)
INFO  | 16:54:13 64 [Power] File /prefs/config.proto (size 115)
INFO  | 16:54:13 64 [Power] File /prefs/db.proto (size 8442)
INFO  | 16:54:13 64 [Power] File /prefs/module.proto (size 94)
INFO  | 16:54:13 64 [Power] 4 files (total size 8708)
DEBUG | 16:54:13 64 [Power] Filesystem files (28672/1048576 Bytes):

meshtastic / firmware

[Bug]: config file corruption due to filesystem size #4184

Category

Hardware

Firmware Version

Description

Relevant log output

Loss of config due to filesystem size limits

case 1: The less serious case (but still probably catastrophic)

case 2: The more serious case:

Fix description

docs for SafeFile

Test results on ESP32