Closed etiennedm closed 5 months ago
Hello,
Of course it wasn't an issue with littlefs. I still don't fully understand the issue, but by lowering the speed on the SPI bus used to talk to the flash this issue went away.
For anyone struggling with a similar issue: I was using a bus speed of 32MHz (max speed specified for the MX25R6435F is supposed to be 33MHz in ultra low power mode/ 1x IO). Switching to 8MHz solved the issue. 16MHz works as well.
Thanks for all the work on littlefs and BR, Etienne
Ah, that makes sense. Glad you figured it out.
The maximum speed specified probably assumes a somewhat perfect bus. If the wires/traces connecting the chip are too long/exposed, extra capacitance or noise can cause the signal to not transition in time and corrupt things.
For this reason most higher speed protocols (SD, USB, PCIe, etc) include a checksum to protect the transaction itself. But not SPI, SPI is a simple protocol.
littlefs also responds to this situation pretty poorly. When it sees a bad checksum, it thinks a power-loss occurred and tries to rollback to a previously good state. But this just makes the filesystem inconsistent.
This is a known issue and there is work planned to add an additional global-checksum so that non-power-loss checksum failures are correctly reported as a corrupted filesystem.
Well what threw me off is that I'm reusing a prototype board from work that we run with a SPI bus speed set at 32MHz. Although not with a filesystem but with a much more basic TLV based circular buffer for data storage (and no checksum, so who knows what is actually happening).
Again, for anyone stumbling on this issue down the road: I was able to reproduce the flash issues by running erase-write-read cycles on flash pages. With Zephyr this is as easy as enabling the flash test shell command with CONFIG_FLASH_SHELL_TEST_COMMANDS=y
.
Regarding the planned improvement: actually rolling back to a previously known good state might be acceptable. Or at least preferable to losing all the data stored in the filesystem. Would there be an option to attempt a recovery in that case?
Hmm, I suppose this could also happen due to environment noise. You don't happen to live next to an airport do you?
Or if your format is simple enough, corruption could be occurring but going unnoticed.
Regarding the planned improvement: actually rolling back to a previously known good state might be acceptable. Or at least preferable to losing all the data stored in the filesystem. Would there be an option to attempt a recovery in that case?
Ah, so the actual problem is a bit more complicated.
There is a window where littlefs can and will rollback to a previous valid state (this is true for any power-resilient filesystem, at least without storing a checksum externally, or destructively aggressive erasing). But since littlefs contains many logs, rolling back one log can reveal state that never existed at the same time as state in the other logs. Attempting to write in this state can result in really breaking things.
Detecting this state and halting is a bit better in that it potentially allows for external debugging / data extraction.
This is the result of littlefs being quite a bit more complicated than a circular buffer, which can be both a good and bad thing.
The API isn't finalized, but there will most likely be a way to bypass the global-checksum if you really want to.
--
The global-checksum stuff is all about detecting corruption. Recovering from corruption is another can of worms.
There are some plans condensating around this, but they are a bit further out: metadata redundancy, data redundancy, and an optional Reed-Solomon block-device.
I've been struggling with a strange behavior for the past few weeks. I'm using a SPI NOR flash (MX25R6435F 64Mb capacity), controlling it from a nRF52840 microcontroller, with a firmware built on Zephyr. I'm using the Zephyr filesystem subsystem, but as far as I can tell the strange behavior originates in LittleFS. I'm using version 2.8.1 (from this PR) of LittleFS, but I had the same issue on version 2.5.0 (the one currently shipped with Zephyr).
Here is what I'm trying to do:
temperature
as the metric name.YYYYMMDD_HH00.pb
('pb' for the serialized struct with protobuf), for instance20231201_2000.pb
.lfs_dir_read
from/lfs/metrics/live/temperature/
to check for the existence of files with a different name. For each file not matching the current target file name, typically the one from the previous hour, e.g.20231201_1900.pb
, move it to/lfs/metrics/pending/temperature/
to be later read back or uploaded./lfs/metrics/live/temperature/
.Most of the time this works perfectly well, however after a few hours I run into an issue where the
lfs_dir_read
of step 2.ii reports the existence of an old file, e.g.20231202_1900.pb
in the/lfs/metrics/live/temperature
directory even though it has been moved minutes ago when the hour changed to 8:00PM (20:00 in 24-hour format). When this happens, the logic described above attempts to move the file, which fails as expected since it was already moved previously.Here are some logs showing this behavior: littlefs_issue_logs.txt.
lfs_read_dir
directory/lfs/metrics/live/temperature/
and find20231201_1900.pb
.lfs_read_dir
directory/lfs/metrics/pending/temperature/
as a sanity check./lfs/metrics/pending/temperature/
since it doesn't match the target file name20231201_2000.pb
.lfs_read_dir
directory/lfs/metrics/pending/temperature/
and can see that the file was moved as expected.lfs_read_dir
directory/lfs/metrics/live/temperature/
and can see it is empty now./lfs/metrics/live/temperature/20231201_2000.pb
lfs_read_dir
directory/lfs/metrics/live/temperature/
and find20231201_1900.pb
again. Even though we confirmed previously that this directory only contained/lfs/metrics/live/temperature/20231201_2000.pb
. You can also notice that the size 3312 does not match the size 3600 when we moved it earlier (line 4 and line 37).lfs_read_dir
directory/lfs/metrics/pending/temperature/
, and can see that20231201_1900.pb
with size 3600 is still there where we moved it./lfs/metrics/live/temperature/20231201_1900.pb
but it fails, as it should since it has been moved earlier.lfs_read_dir
directory/lfs/metrics/pending/temperature/
and can see that nothing changed, as expected.lfs_read_dir
directory/lfs/metrics/live/temperature/
and can see it is now reporting20231201_2000.pb
as expected.Right now the code is a little hard to follow but I could try to boil it down to a smaller reproducible example if necessary. Do you have any idea about what is going on here? Any suggestions as to what you'd like me to log to help troubleshoot this issue?
Thank you and best regards, Etienne