Questions about SPI NAND with ECC

Christian-Sander commented 1 year ago

Hello,

I'm currently looking into writing an implementation for a Winbond W25N01GV NAND flash for ST MCUs over QSPI. As I've seen there have been some others which have also done this (any source available?). Originally I wanted to use AzureRTOS FileX+LevelX but it seems that LevelX has architectural problems with supporting ECC.

I would like to use the hardware ECC feature of the memory, so I have some questions around that topic:

How do you treat read errors where the ECC can correct the bits? In my opinion the block should be marked as bad and be rewritten somewhere else before uncorrectable data loss occurs. Is this handled somehow? I think I have read in another issue that returning LFS_ERR_CORRUPT would just be forwarded to the user? Should the user rewrite the whole file then? This seems a bit wasteful if the file spans multiple blocks.
The Winbond flash can support partial programming into 4 512 byte sectors inside a 2kB page. The hardware ECC is calculated separately for these sectors. I'm afraid that the bad block marking would only affect a part of the page then? Any issues to be expected around this topic? Should I prefer to use the whole 2kB page as programming size?
In another issue I read concerns regarding the validity of the superblocks. Are they stored in block 0 and 1? As far as I know Winbond only guarantees the availability of the first block. What about using more than one file system on the flash? The memory has an internal look up table for bad block management and remapping. Maybe that could be used somehow? Any other ideas?

geky commented 1 year ago

Hi @Christian-Sander, I don't have all the answers but I can give some hopefully helpful comments:

How do you treat read errors where the ECC can correct the bits? [...]

This is not handled well by LittleFS at the moment. LittleFS tries to completely isolate reading and writing operations, so reading operations can not write. This avoids complicated surprises such as ENOMEM on reading a file, but creates problems for how to respond to read errors. At the moment if the block device returns LFS_ERR_CORRUPT on a read, LittleFS more-or-less just gives up and reports the error when it recieves it.

The best option right now, if possible, is to wait to report the error on the next erase operation. Either by tracking the last read error in RAM or by checking for read errors before erasing if it's reproducible. Errors during erasing/programming force the data to be relocated to a better block. (It's also worth noting LittleFS also doesn't currently persist the knowledge of this failure, though the path to improve that is a bit more clear).

Unfortunately this wouldn't cover cases where read errors need to be fixed ASAP due to decreased persistence.

This is something I'm open to ideas on. Maybe a separate routine to explicitly check for read errors would help and could compliment explicit garbage-collection routines.

The Winbond flash can support partial programming into 4 512 byte sectors inside a 2kB page. [...]

I would try to use the smaller program size as that would give you a smaller write granularity and likely improve the overall lifetime of the device.

If LittleFS runs into any program errors it will evict the whole erase block. This is usually what you want as erasing is the destructive operation, programming does no damage on its own. So if one sub-page starts reporting errors the others are likely close to their end-of-life.

In another issue I read concerns regarding the validity of the superblocks. Are they stored in block 0 and 1?

Yes, LittleFS uses blocks 0 and 1 as a sort of "anchor" to find the rest of the filesystem.

As far as I know Winbond only guarantees the availability of the first block.

Ah, this is a really fun question if I'm understanding it right. It's actually not possible to persist data through power-loss with only a single erase block. At some point you need to erase that block to rewrite the data and if you lose power while doing so the entire filesystem will be lost. You need at minimum 2 blocks to survive any power-loss and this is true for any other updatedable filesystem/kvstore out there.

Winbond is likely being conservative with their specs, but let me know if you find a 1-block NAND chip because I think that would be hilariously impractical.

That being said a 1-block disk could be used as write-once memory. In that case you could prepare a LittleFS image offline using littlefs-fuse or mklittlefs or use a RAM block device as an intermediary to write the disk.

If you configure the filesystem's block size to 1/2 the 1-block erase block it would be possible to mount read-only with LittleFS, as read-only LittleFS never calls erase. This is a fringe use-case, but could be useful if you already use LittleFS elsewhere in your codebase.

What about using more than one file system on the flash?

This is not uncommon, but LittleFS doesn't handle this as its not the system in control in this configuration. I think the most common solution is adding an MBR to the block device. This sacrifices the first block as a table to describe the partitions on the block device.

Mbed-OS contains an implementation which might be a useful reference: https://github.com/ARMmbed/mbed-os/blob/master/storage/blockdevice/source/MBRBlockDevice.cpp

Or you might be able to implement your own just based on the Wikipedia description: https://en.wikipedia.org/wiki/Master_boot_record

MBR is nice in that it's relatively simple and extremely common. Any PC OS has utilities to manipulate MBR partitions.

If you want to sneak partition information into the device's bad block table that could also work, you would just need to make sure the information is passed to the block device in such a way that LittleFS sees block "0" as the first block it should use.

Christian-Sander commented 1 year ago

Thanks for your detailed response. I will look into this in the next few weeks after I finish some other tasks. So far I have a running NAND filesystem implementation with FileX/LevelX but it is too slow for my requirements and ECC support is problematic. Maybe there are problems with my code. I will revisit that once the next update with better NAND support is released. In the meantime I would like to try littlefs.

This is something I'm open to ideas on. Maybe a separate routine to explicitly check for read errors would help and could compliment explicit garbage-collection routines.

Maybe it could be possible to report the first ECC error back to the application in the read call? Then the application could decide to call a function to fix it?

Ah, this is a really fun question if I'm understanding it right. It's actually not possible to persist data through power-loss with only a single erase block. At some point you need to erase that block to rewrite the data and if you lose power while doing so the entire filesystem will be lost. You need at minimum 2 blocks to survive any power-loss and this is true for any other updatedable filesystem/kvstore out there.

Winbond is likely being conservative with their specs, but let me know if you find a 1-block NAND chip because I think that would be hilariously impractical.

That being said a 1-block disk could be used as write-once memory. In that case you could prepare a LittleFS image offline using littlefs-fuse or mklittlefs or use a RAM block device as an intermediary to write the disk.

If you configure the filesystem's block size to 1/2 the 1-block erase block it would be possible to mount read-only with LittleFS, as read-only LittleFS never calls erase. This is a fringe use-case, but could be useful if you already use LittleFS elsewhere in your codebase.

I think you may have misunderstood me somewhat. I meant that Winbond guarantees that the first block is definitely a good block, not that only one block in total is good. They have the usual 2% error rate afaik, but guarantee that the first block is a good one. So the question was related to what would happen if block 2 is a bad one. I could handle that in the driver if needed, as the flash has a look up table to remap bad blocks to known good blocks.

geky commented 1 year ago

Maybe it could be possible to report the first ECC error back to the application in the read call? Then the application could decide to call a function to fix it?

This is sort of possible, if your block device implementation reports an error, eg LFS_ERR_CORRUPT, LittleFS halts and returns the error through whichever function is currently executing unmodified.

The unfortunate thing is this doesn't give you access to the data that is there, so the options for recovering would be a bit limited.

I suppose you could keep track of the most recent n read errors out-of-band of LittleFS. That might be one way to recover from a read error without losing data.

I think you may have misunderstood me somewhat. I meant that Winbond guarantees that the first block is definitely a good block, not that only one block in total is good. They have the usual 2% error rate afaik, but guarantee that the first block is a good one. So the question was related to what would happen if block 2 is a bad one. I could handle that in the driver if needed, as the flash has a look up table to remap bad blocks to known good blocks.

Ah, that makes sense. I was wondering if I misunderstood while writing the comment.

Hmm, that's a tricky problem. The current approach in LittleFS is to exponentially backoff writes to blocks {0,1}. If one of these blocks are bad there's not really a way to recover as it's difficult to find the filesystem during mount.

I was looking into making lfs_mount more flexible recently (#349), I suppose it could be possible to allow the superblocks to live in multiple places that could all be searched during mount. I'll have to think about this.

For now the best option is either:

Remap block 1 to a good block as you mentioned.
Find the first 2 sequentially good blocks and use that as the start of littlefs.

Christian-Sander commented 1 year ago

I'm currently implementing the glue logic between lfs and my device driver. In a very first attempt without ECC or write cache I saw a 6x speedup compared to FileX/LevelX when writing two ~1.7 MB files to two different partitions (about 5 seconds). I haven't tested performance with writing smaller files or multiple writes to a file yet but I'm pretty happy with the performance so far.

Regarding the ECC during read operations, I now reserve 20 blocks (=bad block lookup table size) at the end of the flash as replacement blocks that are not used by any partition directly. On correctable ECC errors the block is copied to one of the replacement blocks and then an entry to the bad block lookup table is added, so further reads will read the replacement block instead. I think this solves the issue discussed above.

However, I'm still a bit unsure about the handling of the superblocks and the reliability. I could try to check for bad superblocks before calling lfs_mount and adding them to the lookup table if any of the superblocks is bad. Would that be a solution for this? What happens when the superblock becomes bad during operation? Right now I simply return LFS_ERR_CORRUPT and don't handle this through the look up table.

Also for my understanding: Do you need to erase/write the superblock on each file system write? Wouldn't you get much more wear on these blocks than on the other blocks then? I think I'm probably missing something here.

Christian-Sander commented 1 year ago

Regarding the last question, I've seen that this doesn't happen, so please ignore that question.

I've seen that there are many reads from first four bytes of the superblocks. Could those be cached by littlefs? Might be a good idea for an optimization?

I'm also seeing that iterating over files in a directory and getting their sizes involves a lot of reads and is thus relatively slow. I suppose this is a design limitation of a wear leveling file system that you need to read all pages of a file to get its size. Might be a good idea to cache directory contents and sizes on the application level then?

Christian-Sander commented 1 year ago

And one more: littlefs validates the data written to the flash by reading it back and comparing it. If the flash can do this internally, would it be reasonable to disable this validation? I think this may also give further speedups. Or are you doing this to detect transmission errors on the signal lines to the flash?

geky commented 1 year ago

What happens when the superblock becomes bad during operation? Right now I simply return LFS_ERR_CORRUPT and don't handle this through the look up table.

Unfortunately this is a difficult situation for littlefs. If this happens littlefs locks up and becomes read-only. littlefs just tries its best to avoid this situation by writing to the superblocks as little as possible. Though it may not appear this way at first.

Also for my understanding: Do you need to erase/write the superblock on each file system write? Wouldn't you get much more wear on these blocks than on the other blocks then? I think I'm probably missing something here.

It may be worth understanding how the superblock anchor works (re-skimming DESIGN.md it looks like this is not well documented, also sorry if the names are confusing, at some point the name "superblock" sort of lost its originally meaning).

littlefs actually has a linked-list of superblocks, each containing a copy of some read-only info, and a pointer to the next superblock. The last superblock also contains the root directory, which is the root of the directory tree. Writes anywhere in the filesystem can propagate up the tree + linked-list as a part of wear-leveling, at a rate controlled by the block_cycles configuration.

To keep writes to the anchor superblock as low as possible, littlefs increases the length of the superblock linked-list every block_cycles writes, which is when littlefs would normally try to wear-level the superblock metadata-pair. This results in exponentially fewer writes to the superblock anchor as it is written to.

So say you set block_cycles=100. Initially the superblock linked-list looks like this:

.--------.  1 root write = 1 anchor write
| super  |  100 writes until expansion
| block  |
| + root |
'--------'

After 100 anchor writes, the superblock linked-list gets expanded:

.--------.  .--------.  100 root writes = 1 anchor write
| super  |->| super  |  100*100 = 10,000 writes until expansion
| block  |  | block  |
|        |  | + root |
'--------'  '--------'

And then after another 100 anchor writes (10,000 writes to the root directory):

.--------.  .--------.  .--------.  10,000 root writes = 1 anchor write
| super  |->| super  |->| super  |  100*100*100 = 1,000,000 writes until expansion
| block  |  | block  |  | block  |
|        |  |        |  | + root |
'--------'  '--------'  '--------'

And so on. This grows very quickly.

Assuming each block lasts a certain number, say conservatively 1000, of erase cycles before going bad (which isn't quite how it works), you would need $100^{\frac{1000}{100}} = 10^{20}$ or 100,000,000,000,000,000,000 writes to the root directory before the anchor goes bad. Since the root directory is also being worn down during this, we'd need at least $\frac{10^{20}}{1000} = 10^{17}$ blocks to even accomplish this. Assuming 4 KiB blocks, that's a a 355 EiB disk. Exponential growth is fun!

I've seen that there are many reads from first four bytes of the superblocks. Could those be cached by littlefs? Might be a good idea for an optimization?

Ah yes, every path lookup goes through the root directory, which needs to read the 4-byte revision count to know which of the two blocks in the metadata-pair is the most recent. That would be good to cache.

littlefs is still being developed, but the priority has been on performance without RAM, so there's probably a lot of low-hanging fruit for smarter caching. I've been thinking a general-purpose block-pair => metadata cache that avoids metadata fetches (which require this revision-count read and CRC calculation) could be quite powerful, it just hasn't been a priority to investigate yet.

Also being able to cache > a block of memory would be good if you have the RAM/code size.

I'm also seeing that iterating over files in a directory and getting their sizes involves a lot of reads and is thus relatively slow. I suppose this is a design limitation of a wear leveling file system that you need to read all pages of a file to get its size.

The file size is stored with the other metadata so that's not quite it. You're likely seeing the scanning as a part of the block allocator. Blocks are garbage-collected on the first write after mount and kept track of in a lookahead bitmap RAM. Building this lookahead bitmap is expensive as we need to traverse the full filesystem tree.

I'm hoping to improve this by moving to an on-disk free list of sorts, but that will take some time to get right.

littlefs validates the data written to the flash by reading it back and comparing it. If the flash can do this internally, would it be reasonable to disable this validation? I think this may also give further speedups. Or are you doing this to detect transmission errors on the signal lines to the flash?

To be honest the biggest value of this check is to catch integration mistakes. It's strictly unnecessary if you trust your drivers, bus, and storage to write correctly or report an error.

Unfortunately, integration can be tricky to get right, and hard to debug if it goes wrong since you don't know if the write or the read is the problem. I think the original bug that lead to this check was an SPI bus running at a higher frequency than the capacitance of the traces allowed. That was tricky to track down until you put the bus on a scope and see it going wiggle wiggle wiggle.

Maybe we should make this check optional, but when you compare the overhead of reading/checksumming those bytes versus erasing and programming, it hasn't been worth removing.

Hopefully this also frames the current state of littlefs well, it can useful depending on the application, but is still improving.

Christian-Sander commented 1 year ago

Thanks for the replies! It's definitely useful, I'm just looking for some ways to speed things up a little. Maybe I'll try to cache the last accessed block and the first 4 or 8 bytes of each page. I have a 32 MB SDRAM so I may have a few bytes to spare.

Regarding the superblocks, I think one could use the block remapping feature of the W25N01GV flash. I attempted that last week but had issues implementing that specific command. I'm waiting for a reply from Winbond right now.

Christian-Sander commented 1 year ago

I'm sometimes getting stuck in a infinite loop in lfs in lfs_dir_traverse: grafik My driver logged calls to these actions: grafik p=program, e=erase, b=block (128 kB), p=page (64 pages/block with 2kB size), o=offset, s=size. Any ideas? Is the filesystem damaged somehow, or could there be a bug in my driver?

Christian-Sander commented 1 year ago

Now I also got an error in the log when it happened again: grafik

lfs e b0 ../src/ext/littlefs/lfs.c:1228:error: Corrupted dir pair at {0x1, 0x0} ../src/ext/littlefs/lfs.c:1228:error: Corrupted dir pair at {0x1, 0x0}

I know that I still have thread safety to take care of, but I don't do many write accesses right now so I don't think that's the reason.

I will try to disable my caching to see if it's caused by that.

geky commented 1 year ago

I do have improved cycle-detection over hear, though it only detects cycles faster if that is the issue: https://github.com/littlefs-project/littlefs/pull/746

The stack trace looks like it's the first call to the garbage collector, which scans the filesystem to find free blocks to allocate. It's an expensive operation depending on how many blocks are in your filesystem.

But the corrupted dir {1,0} errors are strange. That shouldn't happen in normal operation unless the filesystem is unformatted. Maybe writes aren't going through to disk for some reason?

Christian-Sander commented 1 year ago

I had it running for a few minutes. I have 1024 blocks in total with 64 pages with 2 kB each. These blocks are spread across 3 filesystems and 20 spare blocks. I think the filesystem is question is apprxoimately 900 blocks. As mentioned, I will try to disable my cache now and see if I can reproduce this problem. I'll also check how the new cycle detection algorithm behaves when this problem occurs and report back.

Christian-Sander commented 1 year ago

I can reproduce the problem with the brent algorithm, I don't get any inifinite loop detection though.

Here's some stack contents, maybe that helps? grafik grafik grafik grafik grafik

disk.block stays at 1, disk.off stays below 100000 as far as I can tell (could be below 131072 (block size) maybe? tag is changing as well with values switching from 4294967295 to 2147483648 in each iteration at this line: tag = (lfs_frombe32(tag) ^ ptag) | 0x80000000; Is it reading the tag from a deleted flash region? My understanding of lfs is not good enough to interpret these results. The value in stack[1].disk.block also seems wrong.

Does the flash need to be erased when you try to mount it the first time? My code calls format if mount fails. Does format erase all blocks? Or should I also do this manually to be sure? The NAND flash contains bad block markers in the blocks that are bad. I haven't managed to use the bad block remapping feature of the flash yet and rely on lfs right now.

My cache implementation right now simply caches the content of the superblocks and returns the cached values when reading. Writes and erases are performed immediately and also performed on the cached data. The cache is invalidated on write or erase errors to the superblocks and reloaded from flash. In my understanding the sync function can remain empty then?

Christian-Sander commented 1 year ago

It was a bug in my caching code and it's fixed now. Erasing block 0 corrupted the contents of the block1 cache. I'll keep my eyes on this to see if the loop occurs again.

kyrreaa commented 4 months ago

The corrected ECC errors would have been nice to "bubble up" allowing handling somewhere but I guess this would have to be at the nand flash level and not in littlefs. I am wondering though, when littlefs looks for a new block to expand metadata into, or to add to a file, it checks if the block needs erasing or is used by another file already (aka the free space map). Is it possible to write something to the weak blocks during formatting that may indicate the block to be unavailable? Or, is the lookahead actually populated by having an "address range" or "scope" and then being built by traversing the directory tree? In which case it would re-use any blocks belonging to a file lost due to power loss before sync or file close.

geky commented 4 months ago

Good question, it's currently the latter. littlefs doesn't keep track of what blocks have had an error, so it will eventually try to reuse them.

BUT, in future designed but not-yet started work, the plan is to add an optional on-disk block-map for tracking in-use/unused/erased/bad blocks.

This should also speed up block allocation, and allow for pre-erasing blocks in lfs_fs_gc. So three birds with one stone hopefully.

I'm also considering adding more advanced bd errors (hypothetical, names tbd):

LFS_ERR_BAD_PROG_BAD_BLOCK_MAYBE_REUSE
LFS_ERR_BAD_PROG_BAD_BLOCK_NEVER_REUSE
LFS_ERR_GOOD_READ_BAD_BLOCK_MAYBE_REUSE
LFS_ERR_GOOD_READ_BAD_BLOCK_NEVER_REUSE
LFS_ERR_BAD_READ_BAD_BLOCK_MAYBE_REUSE
LFS_ERR_BAD_READ_BAD_BLOCK_NEVER_REUSE

Which you could decide between based on the number of bit errors. A single bit-flip may not be a bad sign yet, but multiple may mean the block needs to be retired, for example. This is likely highly hardware dependent.

kyrreaa commented 4 months ago

I am using lfs in Zephyr on Nordic nRF53 (using 3rd party added code, not the one in Zephyr branch in Nordic SDK as it laggs behind). Since the API in Zephyr does not allow these bit errors to bubble up I log them separately. I have noticed 1-3 bit errors now and then on quite new chips and usually it does not re-occur on same block upon next re-use. I do however see a few erase faults on some blocks on a few chips, and these appear persistent. My thought there was to mark these blocks somehow so a initial check from lfs could determine the block to be not worth using.

geky commented 4 months ago

This will need fleshing out when the hypothetical block map actually starts to take shape, but there we will probably also want something like lfs_fs_bad/unbad(lfs_t *lfs, lfs_block_t block) (naming is hard), to allow users to proactively mark blocks as bad outside of bd operations.

I have noticed 1-3 bit errors now and then on quite new chips and usually it does not re-occur on same block upon next re-use. I do however see a few erase faults on some blocks on a few chips, and these appear persistent.

My understanding is there are a couple of different sources of bit errors in NAND:

read/write perturbs/gamma rays, where the density of NAND just leads to some electrons leaking sometimes

These aren't a concern, but you may need error-correction to prevent filesystem corruption.
Manufacturing errors, where it's just not possible/financial reasonable to build these things perfectly

I think most NAND usually guarantees <some % of bad blocks after manufacturing. Some also mark the bad blocks with a special value coming out of the factory. It may be worth checking the data sheet to see if it says anything about this.

To handle the factory-bad-blocks, littlefs really needs this hypothetical block-map.

A possible workaround for now, though it is a bit involved:

You could reserve the first page of each block to store your own bookkeeping, and mark bad blocks somehow (either by relying on the factory marks, or by storing some big unique value in every good block).

Then you could report LFS_ERR_CORRUPT if littlefs tries to erase such a block. This won't stop littlefs from trying to erase the block again, but on each erase failure littlefs will just move on to the next available block.

Though long-term the hypothetical block-map is the correct solution to this.

littlefs-project / littlefs

Questions about SPI NAND with ECC #727