Closed geeksville closed 3 months ago
I might (hopefully will) look into this in a week or two.
This same corruption probably exposed #4167 a couple of days ago.
hmm this is less that perfect (though quite possibly unrelated to the problem):
// brief window of risk here ;-)
if (FSCom.exists(filename) && !FSCom.remove(filename)) {
LOG_WARN("Can't remove old pref file\n");
}
if (!renameFile(filenameTmp.c_str(), filename)) {
LOG_ERROR("Error: can't rename new pref file\n");
}
We could eliminate this window of risk by renaming the file.new to be file.good, then remove file, then rename file.good to be filename (a 3 stage commit). Then at load time if we ever see a file.good existing, we know that we lost power during that window and file.good should be used instead of file (and file should be deleted at that point.
But this might not actually be the bug, so I'll wait until I look into this and somehow make a reproducable failure.
I believe I am seeing the same issue on a RAK19007/RAK4631 (2 of them actually). I believe the issue is specific to nRF builds. I have roughly 10 devices (all running the same version - 2.3.13) and only the two nRF based modules have run into this.
This board is for use in a solar node (bought the second to troubleshoot and eliminate hardware), when connected via USB-C to a host (capturing serial log), the issue does not occur, but once I close it up and place it outside, the issue occurs after some seemingly random amount of time. I don't believe it is the battery going low (causing a reset), when I reattach the device, the battery is general around 90%, but appears to have experienced a factory-reset. I suspected corruption, but have not been able to capture a log of it occurring.
Open to suggestions - considered connecting an ESP32 (on a separate power source) to the serial lines and using it as a uart->wifi bridge, I suspect this is the only way I'll capture it as I have had it occur 6 times as described above, and never when attached via USB-C.
@ImprobableStudios thank you for your report. I'm eager to help look into this but alas I'm doing powermon/structured logging/power measurements for another weekish. @thebentern and I have been talking about ideas related to this but I don't think anyone has really dove into this yet.
and some copypasta from our chat so I don't forget it:
a slight reformatting of the list of options I could think of (ya'll can think of others possibly!):
it seems to me if there is a LFS corruption bug there are four robust possible fixes:
I think options 1, 3 or 4 is what I'd probably lean towards (though I haven't spent any time getting a reproducable case or looking at the implementation).
I'm leaning towards quickly implementing 3 and then constructing a test case so that 1 could be acheived.
Heh, funny... Before reading your post, I was also thinking something 2-like would be a more hardened solution, especially if you unmounted it when not in use. My thinking was store a backup of the core-config files in that partition when the config is updated leaving the NodeDb and other collected data to be wiped in the case of a failure. This, in theory, would be pretty solid as long as the issue isn't something overwriting the partition boundaries or something nasty like that.
After seeing the issue I did check and saw that nRF builds use a different underlying file system than ESP, which would help to give credence to this being an issue in the FS itself.... while I agree option 1 would be the absolute best - I also think it will be the most difficult...
Something else this makes me think about that would be nice to have is a system-telemetry set for events in the system. I design high-end sensors (think $3000 IMUs) and recently built something similar into the next generation. Here it would require PB work, etc., but keep a small telemetry partition - does not need a file system just a set of 32-bit (?) values for different events such as watchdog reset, radio errors, file-system errors, overall "uptime" since last factory-reset, etc. Would allow for establishing how well a node is performing before sticking it someplace inaccessible... Flash wear is the problem with it and why I think it is something that should only be on for test and node/fw-build eval...
I might have to put some thought into that.
One potential caveat to option 2 is that we'll have to explore is where the BLE bonds are stored by BlueFruit, as restoring the core configuration is only half of the experience. A failed bluetooth auth flow requiring the user to forget and re-pair the device to their phone will still be an issue.
Hmm - the adafruit fork of littlefs is quite old (last commit to their lfs.c was five years ago and the main littlefs project has many fixes which are not in the adafruit nrf52 tree... hmm...
(just adding this link for later reference: https://github.com/littlefs-project/littlefs )
I still think one of the first steps is to make a stress test that shows the problem, so that we can confirm we fixed it (after fixing whatever it is ;-)).
Also it might be that we need to be very careful to partition our BLE operations (in particular the FromNum notify call) from other LFS operations? Also see these interesting comment re shared directory structures: https://github.com/adafruit/Adafruit_nRF52_Arduino/pull/372#issuecomment-565705303
Connecting to the public MQTT can cause the issue as well.
Some other misc comments for when someone gets to looking at this:
btw: the 'good' news is that I can pretty easily reproduce this bug by power cycling a rak4630 (noticed while doing my power testing). I'm still planning on working on this issue once power stuff is done.
oh: interesting: I can cause this corruption by performing a filesystem write 100% of the time on rak4630 by doing a write while the battery voltage is down at 3.2V.
imo this is definitely it. over the next couple of days I'll add some instrumentation (both with the oscope and in sw to print vcc voltage via a ADC). Then I'll figure out why our brown-out detector didn't kill the processor before we ever reached this flash write. Possibly the voltage is barely above the (current) brown-out threshold and the higher current draw of doing the flash writes is enough to drop V below it. Thus almost guaranteeing flash corruption.
@garth in my case I was working on getting system-off state working and in that test on the way to deep sleep I'm doing the FS writes (essentially the same code as the normal reboot case). But I'm doing it while hooked to a programmable bench supply that was repeating the test at a series of voltages. Once at 3.3Vish things start to go bad (because the LDO on the rak board requires a min 3.3V input to meet spec - I bet it's output voltage to the processor in that case begins to get real-crummy real-fast). I'll know better once I can read that voltage in sw and with a blue wire.
Important Note: BatmansWang on discord reports having corruption without low batteries so there might be something else going on (possibly not related to voltage, possibly still related - tbd)
(Any comments/advice on the notes below are apprecated!) (This is distinct from the 'loss of config due to low-power filesystem corruption' problem discussed elsewhere)
I think I found another (far more common in the field probably) cause of filesystem corruption (and caused this in my test config). If you are in an area that can see a lot of nodes it is possible for your local node DB to reach the maximum size of 100 nodes. This means that the final size of db.proto on the disk would be 16600 bytes + a few kb for the other stuff in deviceState. Which is a bummer because on the nrf52 our filesystem size is only 28672.
We always write pref files in an atomic fashion (first writing the new file then deleting the old version). Since 2x16600 it guarantees in that case that the write will fail (because bigger than the filesystem).
This causes two failures - one less serious and the other which can cause loss of all device state (even at less than 100 nodes).
For reference here's some current file sizes as my node was slowly learning the mesh (in a busy area that can certainly see 100 nodes)
DEBUG | 15:15:51 2687 [Power] File db.proto (size 7000)
DEBUG | 15:15:51 2687 [Power] File config.proto (size 99)
DEBUG | 15:15:51 2687 [Power] File module.proto (size 101)
DEBUG | 15:15:51 2687 [Power] File channels.proto (size 57)
DEBUG | 15:15:51 2687 [Power] 5 files (total size 7833)
I've partially completed the fixes for this 'non low power related' fault. Summary of the fixes I'm making:
/**
* This class provides 'safe'/paranoid file writing.
*
* Some of our filesystems (in particular the nrf52) may have bugs beneath our layer. Therefore we want to
* be very careful about how we write files. This class provides a restricted (Stream only) writing API for writing to files.
*
* Notably:
* - we keep a simple xor hash of all characters that were written.
* - We do not allow seeking (because we want to maintain our hash)
* - we provide an close() method which is similar to close but returns false if we were unable to successfully write the
* file. Also this method
* - atomically replaces any old version of the file on the disk with our new file (after first rereading the file from the disk
* to confirm the hash matches)
* - Some files are super huge so we can't do the full atomic rename/copy (because of filesystem size limits). If !fullAtomic then
* we still do the readback to verify file is valid so higher level code can handle failures.
*/
class SafeFile : public Print
...
ok all the stuff needed for nrf52 is done. Tomorrow I'll test on esp32 (and check esp32 filesystem size) then send in the PR.
ESP32 very unlikely to exhibit this bug, because the filesystem there is huge and empty.
INFO | 16:54:13 64 [Power] File /prefs/channels.proto (size 57)
INFO | 16:54:13 64 [Power] File /prefs/config.proto (size 115)
INFO | 16:54:13 64 [Power] File /prefs/db.proto (size 8442)
INFO | 16:54:13 64 [Power] File /prefs/module.proto (size 94)
INFO | 16:54:13 64 [Power] 4 files (total size 8708)
DEBUG | 16:54:13 64 [Power] Filesystem files (28672/1048576 Bytes):
Category
Other
Hardware
Other
Firmware Version
2.3.14.c67a9dfe
Description
I bet it is not wm1110 specific. Occurred while doing hundreds of power cycles.
I bet the best way to find/fix it is to turn off our "if config read fails or is corrupted completely factory reset and try again". Instead we should spin inside the ICE debugger.
Notes from chat:
Relevant log output
No response