Closed geeksville closed 2 months ago
btw: see excellent log and comment from @lewisxhe in https://github.com/meshtastic/firmware/pull/3818#issue-2282319551.
I think we might need to actually define LFS_ASSERT to set a flag indicating that our current file operation had a low level failure and check for that flag in SafeFile (because the lfs routines are not returning a failure code in this case).
In particular this code which is called for both file reading and writing:
static int lfs_ctz_find(lfs_t *lfs,
lfs_cache_t *rcache, const lfs_cache_t *pcache,
lfs_block_t head, lfs_size_t size,
lfs_size_t pos, lfs_block_t *block, lfs_off_t *off) {
if (size == 0) {
*block = 0xffffffff;
*off = 0;
return 0;
}
lfs_off_t current = lfs_ctz_index(lfs, &(lfs_off_t){size-1});
lfs_off_t target = lfs_ctz_index(lfs, &pos);
while (current > target) {
lfs_size_t skip = lfs_min(
lfs_npw2(current-target+1) - 1,
lfs_ctz(current));
int err = lfs_cache_read(lfs, rcache, pcache, head, 4*skip, &head, 4);
head = lfs_fromle32(head);
if (err) {
return err;
}
LFS_ASSERT(head >= 2 && head <= lfs->cfg->block_count);
current -= 1 << skip;
}
*block = head;
*off = pos;
return 0;
}
Okay - after a fair amount of stepping through lfs.c, I've definitely found the cause of this bug and I have some fixes to help with it. Alas, I need to clean it up before checkin. It is a little tricky because want to avoid having to fork the adafruit project (again) - so I'll need to put it in our code instead. Unfortunately we have some visitors until Sunday night so I might not be sending this one in until Monday.
The fixes attached to this issue fix 'most' (but not all - see #4447 for findings/risk estimate/recommentation for 3.0) remaining problems for nrf52 filesystem reliability.
With the improved LFS debug output we can see that if we rebooted while a file was open for writing the next LFS write will likely fail (in lfs_deorphan).
INFO | ??:??:?? 5 Saving /prefs/config.proto
ERROR | ??:??:?? 5 LFS assert: pcache->block == 0xffffffffLFS assert: entry.d.type == LFS_TYPE_REGCan't open tmp file for readback
ERROR | ??:??:?? 5 Can't write prefs!
INFO | ??:??:?? 5 Saving /prefs/module.proto
ERROR | ??:??:?? 5 LFS assert: entry.d.type == LFS_TYPE_REGCan't open tmp file for readback
ERROR | ??:??:?? 5 Can't write prefs!
INFO | ??:??:?? 6 Saving /prefs/channels.proto
INFO | ??:??:?? 6 Saving /prefs/db.proto
INFO | ??:??:?? 7 Clearing bluetooth bonds!
Issue #3818 improved the LFS_ASSERT behavior by making it not reboot the board, unfortunately it didn't pass up an error code we could detect at file write time so we can handle this failure cleanly.
The existing lfs.c code assumed that LFS_ASSERT would NEVER return (if failing), so it has no logic to return a failure code in that case. This set of changes includes a new lfs_assert_failed global that the new lfs_assert() function calls. In SafeFile.cpp we now check for this flag and consider the write/read to be failed regardless of what the File methods say.
The good news is that the filesystem READS would still succeed.
The bad news is that the (well intended) test write of meshtastic.txt was guaranteed to encounter this failure.
So IMO we should not be writing mesthastic.txt as a test. Instead we should attempt to read out config files (reads are not likely to fail in this case), then attempt to rewrite our config files. If the write of our config fails we reformat the FS (was was done when the 'meshtastic.txt' test fails) and write our (valid) config files.
The old behaviour would see the failure on writing mesthastic.txt (before we read our config files) and at that point all hope was lost: because it reformatted - losing the config forever.
We have to add the following nast check of file.name to listDir().
while (file && file.name()[0]) { // This file.name() check is a workaround for a bug in the Adafruit LittleFS nrf52 glue
Because the adafruit iteration code doesn't properly clear _file/_dir in the case of errors returned by little-fs. This makes the check of f.isOpened() falsely return true when really the file does not exist (because we failed while iterating).
if (!FSCom.exists(filename)) {
LOG_INFO("File %s not found\n", filename);
return LoadFileResult::NOT_FOUND;
}
The exists method has the same bug as problem 2, if we fail while iterating through any file (even if we would have succeeded in reading from filename), exists returns false. So do not use exists() (so we don't need to fork the adafruit lib to fix it there). Instead just try to open the file and only fail if that operation can not succeed.
While stress testing (rarely - a couple of times a day on a node repeatedly writing/renaming/rebooting/reading files) filesystem corruption was seen that was not caused due to a reboot while writing. After some debugging and checking github history, it seems that someone else had already found and fixed this bug: https://github.com/littlefs-project/littlefs/commit/97d8d5e96a7781596708664f18f2ea6c3a179330
As poor luck would have it they fixed it a few days after adafruit forked from the master LFS tree. The version in adafruit was forked on oct 15,2018 https://github.com/adafruit/Adafruit_nRF52_Arduino/commit/f32868509b8075366428305c860edf576499b6f1. The version in framework-arduinoadafruitnrf52 is: commit cb62bf2188854c5b7c44383571ebb19a414e6137 (tag: v1.6.1)
Since it seemed I needed to update LFS, I went ahead and updated it to the 'final' LFS 1.x release (v1.7.2) so that hopefully this update will be more useful to other adafruit users. I'll be sending them a PR but for the time being I'm changing meshtastic to pull from my fresh fork of their project (see attached commits).
With this fix the 'rename related' corruption no longer occurs.
I updated lfs.c, lfs.h and lfs_util.h by first generating a patch file between the old adafruit tree and v1.6.1 of LFS, then copied in the v1.7.2 version of those three files, then reapplied the patch file so that the (small number of) existing adafruit local changes were preserved.
adafruit specific changes to LFS are capured now in this patch file in ~/tmp
diff -crB ~/development/meshtastic/littlefs littlefs > dfile.patch
kevinh@kdesktop:~/.platformio/packages/framework-arduinoadafruitnrf52/libraries/Adafruit_LittleFS/src$ code dfile.patch
kevinh@kdesktop:~/.platformio/packages/framework-arduinoadafruitnrf52/libraries/Adafruit_LittleFS/src$ mv dfile.patch ~/tmp/lfs-adafruit.patch
apply patch with "patch -p0 --dry-run < dfile.patch"
(These are probably not super interesting, but they were in my notes so I'm keeping them for future searchers/me)
Theory: perhaps the problem is that the flash page is still in writing mode upon reboot?!!! Result: No - flash is not in write mode after reboot
Theory: is it possible other flash writes are in the queue so the count of our semaphore gets fucked up?
Result: NO - because happens even if soft device is disabled with useSoftDevice=false!
Thoery: does flash_cache.c have a bug Result: NO (I think) due to fairly careful inspection (and temporarily disabling caching entirely)
Theory: should we throttle nodeDB filesystem writes Result: NO. Currently we only write when about to reboot/shutdown or 1-minute after any DB changes. So in the normal case we'll be writing very rarely anyways so the remaining risk is lowish.
Theory: icache is disabled if you use debugger to reset CPU while programming? Result: Probably minor and possibly related to me being under the debugger. https://infocenter.nordicsemi.com/topic/ps_nrf52840/nvmc.html?cp=5_0_0_3_2#register.ICACHECNF
Category
Hardware Compatibility
Hardware
Rak4631
Firmware Version
master (private build)
Description
Moving power-supply related flash corruption out of #4184 and into this bug.
Relevant comments copied:
btw: the 'good' news is that I can pretty easily reproduce this bug by power cycling a rak4630 (noticed while doing my power testing). I'm still planning on working on this issue once power stuff is done.
oh: interesting: I can cause this corruption by performing a filesystem write 100% of the time on rak4630 by doing a write while the battery voltage is down at 3.2V.
Relevant log output
No response