Does littlefs handle hardware metastability on interrupted write/erase ?

fgrieu commented 2 years ago

With Flash and EEPROM memory, it's possible that a physically interrupted write or erase leaves a hardware memory cell in a metastable state, that is a state such that reading will return 0 in some condition (e.g. cold/now), or 1 (e.g. hot/some days later). If you ask hard enough , manufacturers of serial Flash will end up acknowledging that, and the insurances they give on reliability assume that erase and write cycles have not been interrupted by power loss.

The consequence for a file system is that on reset, it's not enough that some data reads fine to conclude it will read fine next time.

I know at least three ways to handle the issue: 1) Ignore it, reasoning that it's rare, to the point that it's non-trivial to prove experimentally that it can reach the application level. 2) Use a small area of memory without this metastability issue (e.g. battery-backed RAM) to handle recovery on power loss. 3) Handle it with Flash only. There are some ways to make good use that Flash physically allows overwrite (as long as it's with the original data) to at least handle interrupted writes of critical flags, and build on top of that to handle interrupted erase.

I wonder what littlefs currently do w.r.t. that issue.

slorquet commented 2 years ago

Hi,

I am working with fgrieu on this topic.

We can add the following details: littlefs already has protection to detect interrupted writes by using a CRC but that is not enough.

The reason is that a flash write can be read back correctly once, and can then go bad later! In particular, the CRC itself can be partially written and this can go undetected.

It does not happen on the same way on all flash circuits, because the actual bit-per-bit write behaviour is usually totally undocumented. Even so, the current bit being written (or more than one) can be in an intermediate and undetermined state that is neither zero or one, but will oscillate according to environnemental conditions.

Solution (2) mentioned above is the easier to implement: Use some NVRAM as a guard.

With this, the flash layout of littlefs does not need to be changed, but there is a dependency on some non volatile RAM mechanism of the host. This can be abstracted with a code callback and disabled by default. This can be considered an optional feature, not all systems would want this.

Details on solution (2) follows.

The NVRAM storage required is 32 bits for the CRC and storage for a physical recovery flash address.

The NVRAM must be actual battery-backed SRAM.

At system boot the NVRAM is checked. If the recovery address is set to some valid value, this means that the write has been previously interrupted. In that case, the flash write of the CRC value is restarted, then the recovery address is set to "none". This operation confirms the write by making sure that all bits of the CRC are in a stable state. This is possible because all serial NOR flashes allow an overwrite of the same data without a prior erase.

Before writing any metadata CRC, this CRC has to be saved in NVRAM, along with the address where it shall be written. Then the actual CRC flash write is be done. Then the write address is set to "none".

In any nominal situation this is transparent, but in case of write interruption on the metadata CRC, the write can be restarted at next boot, which stabilizes potentially unstable bits to their intended states.

Again it is important to note that reading a good CRC from flash is not sufficient, because it might be just good for a short while if the flash bits are unstable.

--

Solution (3) (use flash only) is the most robust and does not have any host dependency, but it requires changing the littlefs flash layout. A "confirmation flag" can be written sequentially after the CRC write. This flag, when written (even partially) confirms that the previous CRC writing has completed. Similar recovery at boot can be applied by reading both the CRC and the flag.

I hope this issue is of some interest for your team. littlefs does a huge job already at embedded flash reliability and we hope we can get it better.

We have started this conversation to get your ideas about this question.

geky commented 2 years ago

Hi @fgrieu, @slorquet, this is quite an interesting problem, thanks for creating an issue about it.

It's worth noting there's a few other ideas being explored around how to improve littlefs's resilience, most notably the addition of a "global CRC" that gives you an additional checksum of the entire filesystem. I don't know if this would help with this issue though, since it would only be a hard error if the checksum mismatched and not provide recovery. I can go into more detail on how this would work if there's interest.

Some initial thoughts:

Solution (2) isn't preferred because of the extra hardware requirement. Requiring NVRAM would be an unfortunate outcome.

That being said there's no reason littlefs can't provide the hooks necessary to make this work on platforms where NVRAM is available.
For solution (3), littlefs has generally avoided a second "confirmation" write since this doesn't work/is expensive on storage with large program sizes (this could be other storage such as SD/eMMC, or could be due to block-device layers adding ECC or encryption).

I wonder if this could be handled by another unrelated mechanism I've been looking at.

littlefs stores metadata in a number of small, 2-block logs. The reason for 2 blocks, is so that one block can be erased while the other retains the original data. But when a block isn't being erased, the second block mostly sits there unused. I've been considering instead to keep a redundant copy of the metadata in the second block, such that any metadata updates need to write to both blocks in lock-step. This would increase the cost of metadata writes by 2x, but provides a form of error-correction in case one of the metadata blocks is corrupted. It also has a few other small benefits such as making it easier to detect unknown littlefs images with block 0 always containing a superblock outside of power-loss.

In the case of a metastability-event:

One metadata-block is metastable, one is incomplete - littlefs will see they don't match and rewrite the second metadata block to correct this.
One metadata-block is stable, one is metastable - littlefs will not detect this, but if the metastable block fails its CRC later littlefs will be able to correct it with the data on the first metadata-block.

Thoughts?

It does mean we would be intentionally leaving potentially metastable data on disk, but there would always be at least one copy that is stable.

This operation confirms the write by making sure that all bits of the CRC are in a stable state. This is possible because all serial NOR flashes allow an overwrite of the same data without a prior erase.

This isn't true for all flash though, NAND flash specifically does not allow or has limited overwrite since it can perturb neighboring pages. I've also seen overwriting prohibited for device's internal flash, though I never learned why.

Fortunately this can easily be handled by just moving the data to a new block with a full erase+program cycle. This is already done if a bad-block is detected.

slorquet commented 2 years ago

Thank you for this reply and your interest in the subject.

We are reading your response very carefully and will get back to you soon with elaborated comments.

It is very clear that this level of integrity assurance would be an optional feature one way or another, because flash is so complex and so diverse.

slorquet commented 2 years ago

You are correct that confirmation writes are very wasteful on flashes that have ECC, encryption and/or large program size, this is an issue we have encountered on some specific proprietary platforms in the past.

But as it's possible to have optional nvram hooks to make use of it on platforms where it's available, it would be possible to have optional confirmation writes on platforms where the operation is possible without awful compromises.

tim-nordell-nimbelink commented 2 years ago

@geky -

This isn't true for all flash though, NAND flash specifically does not allow or has limited overwrite since it can perturb neighboring pages. I've also seen overwriting prohibited for device's internal flash, though I never learned why.

For some of the STM32 microcontrollers (like the STM32L4xx series), for every 64-bits written they write out 8 ECC bits as well supporting 1 bit error detection and correction, with 2 bit error detection. Since the ECC bits are automatically computed for each 64-bits, they don't let you set individual bits to 0 after the fact since that'd break the ECC computation. (They do support setting all to 0 after the fact, however, as a special case; I suspect they engineered the 8-bits of ECC to be all 0 with the double word set to all 0s.)

ZanoZ commented 1 year ago

what a interesting topic. I think it is better not rely on hardware funtions (apply from chip), because it will lead to limitation. I suggest to use double file( one new one backup, update alternately with ecc/crc/crc32...etc) to record the data. and when reading ,check both of them , and make the result currect by kick off the wrong one.

geky commented 1 year ago

Sorry about the late response, I haven't been able to make it back to this issue yet though I realize it's an important issue.

For some of the STM32 microcontrollers (like the STM32L4xx series), for every 64-bits written they write out 8 ECC bits as well supporting 1 bit error detection and correction, with 2 bit error detection. Since the ECC bits are automatically computed for each 64-bits, they don't let you set individual bits to 0 after the fact since that'd break the ECC computation. (They do support setting all to 0 after the fact, however, as a special case; I suspect they engineered the 8-bits of ECC to be all 0 with the double word set to all 0s.)

I just wanted to comment thanks @tim-nordell-nimbelink for sharing this, this was exactly the info I was looking for. It makes a lot of sense that complications from ECC bits would interfere with masking writes.

geky commented 1 year ago

I just wanted to write up what my current thoughts are on this, and what I think the next steps will be, in case it is helpful to anyone. The problem of metastability and error-correction in general has been in the back of my mind, but balancing time with other issues I haven't been able to implement anything yet.

These are just in the idea/design phase. I've been focusing on getting a design correct in the long-term, which I realize may be unhelpful in the short-term.

Current plans:

Add a filesystem-level checksum (global-crc, gcrc) that can be easily read through a public API after mount and any write operation.

Alone this is not sufficient for the above NVRAM proposal. The checksum alone is not enough information for littlefs to know how to recover, but it could be stored in NVRAM to at least detect errors littlefs can't recover from/warn users/log/etc.
Add the above metadata redundancy scheme, where both metadata blocks contain a copy in normal conditions.

Most likely optional, but defaulting to enabled.
Extend lfs_fs_mkconsistent to let you force littlefs to replace any bad metadata blocks on demand.

With the fillesystem-level checksum and a colluding block-device, I think it would be possible to implement the above NVRAM recovery scheme:

At the beginning of a write, the block-device writes the address and current checksum to NVRAM.
After mount, the system compares the checksum against what is stored in NVRAM. If they don't match, the system should call lfs_fs_mkconsistent while the block-device reports the address saved in NVRAM as a bad-block. This will make littlefs replace the failed write with a new metadata block.

Though I don't know if this is an acceptable solution, since it moves a lot of the work to the user.

But my current thoughts are if we can treat metastability the same as other forms of data-loss/bitrot, that's a win for a simpler filesystem.

littlefs-project / littlefs

Does littlefs handle hardware metastability on interrupted write/erase ? #671