Is lfs_config.block_cycles the endurance of the device?

AdvelC commented 2 years ago

1) "_block_cycles is the number of erase cycles before littlefs evicts metadata logs as a part of wear leveling. Suggested values are in the range of 100-1000, or set blockcycles to -1 to disable block-level wear-leveling.".

Is block_cycles the same as the endurance of a NVM device? The online simulation seems to show an area being no longer used when block_cycles is reached. A serial NOR flash device may have an endurance of 100k program/erase cycles, so I wondered why littlefs comment suggests a range of 100 to 1000. If it is not the endurance of a block, how does littlefs know when a block on the device, or an area in a block is worn (there is no other related item in lfs_config)? The erase & program times of an area on a device tend to go up as an area becomes more worn. If we design the erase and write interface functions to return an error indicating that the operation failed due to the NVM manufacturer's maximum operation times being exceeded, can LFS use that to tag a block as being worn?

2) Is it possible to steer a logical file into a physical area of the device, such as allocating a 3KB file to reside in a single physical 4KB block on a device rather than it spanning more than 1 physical block or being fragmented across multiple blocks?

3) Is there an active user forum that discusses LFS other than the ticket system on github?

Thanks.

thrasher8390 commented 2 years ago

I'm not aware of any other forum, sadly.

I'm just a user of LFS and my recollection is that LFS uses the read verification to start marking blocks as corrupted. It doesn't store this information in flash so it would need to relearn after each power cycle. To my knowledge, LFS doesn't use any static information for block wear it is all dynamic.

geky commented 2 years ago

The online simulation seems to show an area being no longer used when block_cycles is reached.

Ah, that simulation is unfortunately quite out of date. It's still using v1, which relied on bad-block reporting, but didn't provide true dynamic wear-leveling. v2 now has dynamic wear-leveling (which introduced the block_cycles configuration option.)

The first thing to note is that can use "bad-block" information such as the NVM's operation time! (This is the first time I've heard of this failure condition, is this a flash device if you don't mind me asking?). If you return LFS_ERR_CORRUPT from the block device erase or prog functions, littlefs will assume the block is now bad and move any data to a new block.

However as @thrasher8390 mentioned it doesn't store this information anywhere, so the allocator may attempt to use the block again later for new data. This may change with a rework to the allocator (https://github.com/littlefs-project/littlefs/issues/75), but it's not clear yet what that will look like.

So, the reason this changed in v2 is because relying only on bad-block information turned out to not be a good general purpose strategy for deteriorating storage. One issue is not all storage devices can detect bad blocks, but even on those that can detect it with a heuristic for finding bad blocks, intentionally creating blocks with excessive wear can lead to unexpected behavior and data loss. One of the bigger problems is that blocks with excessive wear can retain data for a shorter period of time, physically leaking electrons because the insulation has worn down.

The way dynamic wear-leveling works in littlefs is by keeping a rough idea of how many times each block as been written, and after a certain number of cycles relocate the data to a new block. Unfortunately the act of relocating the data is relatively expensive, so we don't want to relocate the data every single time. This relatively arbitrary number is what the block_cycles configuration variable controls, honestly just for lack of a better name.

Higher numbers means data is relocated less often, but wear is level evenly distributed. You could turn it off and only rely on bad-block information to move data, but this would risk the above problems.

That's why the numbers of ~100-1000, if you raised it to 100K that would effectively be the same as disabling it completely.

Is it possible to steer a logical file into a physical area of the device, such as allocating a 3KB file to reside in a single physical 4KB block on a device rather than it spanning more than 1 physical block or being fragmented across multiple blocks?

Currently, no. One option is increasing the configured block size to 4KiB, it only needs to be a multiple of the physical erase size. But the block allocator isn't smart enough to find sequential blocks.

Consider that such a file would be copy-on-writed and wear-leveled, so it's location would change whenever you write to it.

Requirements for this can often be solved using a partition table such as MBR, with one partition for the location-explicit data, and one partition for littlefs containing other data in the system.

Is there an active user forum that discusses LFS other than the ticket system on github?

No, though I'll take these comments as feedback this is wanted. Thoughts on what type of forum would be best?

AdvelC commented 2 years ago

Thanks for the input @geky.

The first thing to note is that [sic] can use "bad-block" information such as the NVM's operation time! (This is the first time I've heard of this failure condition, is this a flash device if you don't mind me asking?). If you return LFS_ERR_CORRUPT from the block device erase or prog functions, littlefs will assume the block is now bad and move any data to a new block.

It's serial NOR flash. For example https://www.infineon.com/dgdl/Infineon-AN202731_Understanding_Typical_and_Maximum_Program_Erase_Performance-ApplicationNotes-v03_00-EN.PDF?fileId=8ac78c8c7cdc391c017d0cf9df6c576b which says..

When flash memory cells are manufactured, the individual cells in the array program and erase at slightly different rates following a Gaussian-like distribution. A very high percentage of cells program and erase around the typical value. Each time a cell is programmed or erased, the measured timing difference is very slight (on the order of picoseconds). Sometimes the cell programs faster and sometimes it programs slower, trending toward a higher probability of programming more slowly, the more times it is erased. The maximum program/erase times listed in Table 1 specify the slowest-performing cell in the device, after the listed number of erase cycles and under worstcase conditions

Also see figure 17 in "Introduction to flash memory - Proceedings of the IEEE.pdf"

So, as I understand it, an area on a device taking longer than the specified maximum to erase or program is an indicator of it being "worn".

geky commented 2 years ago

That's quite interesting, thanks for sharing. It certainly makes sense that an exceptional program/erase time would indicate a block is unreliable.

AdvelC commented 2 years ago

@geky. If the usage of a block has reached the count of block_cycles and the allocator then decides to move the data in that block to a different block to implement wear levelling, then can the original block be reused in the future? If we have block_count set to 500 then it's only partially lived its 100k P/E cycle lifetime. So, assuming a block can keep getting freed up and subsequently reused, how does the FS keep track of the total cumulative use of that or any block? As you mention, the retention time of a block decreases as the #P/E cycles increases, so a simple initial write, read-verify check is not sufficient to ensure the data is retained for the required design life time. So how would one ensure that use of the FS is such that a design's data retention times need is met?

_There is information in the Cypress/Infineon app note AN217979 (https://www.infineon.com/dgdl/Infineon-AN217979_Endurance_and_Data_Retention_Characterization_of_Infineon_Flash_Memory-ApplicationNotes-v03_00-EN.pdf?fileId=8ac78c8c7cdc391c017d0d30d6b064f5) that covers retention as a function of temperature and P/E cycles._

geky commented 2 years ago

Hi @AdvelC, sorry about the late response.

Yes blocks can be reused. The block_cycles just determines how many erase operations we allow on the block per allocation, allowing a tradeoff of performance+fewer metadata updates for less evenly distributed wear. After this and other discussions I'll probably change the name to "aggressiveness" or something similar the next time we make breaking changes to the API.

On top of this LittleFS provides a form of statistical wear leveling by allocating blocks in a uniform distribution. This is done by allocating blocks linearly with a random starting position chosen at boot. This won't perfectly level wear, but LittleFS is already only able to provide dynamic wear leveling, and the behavior of flash as it approaches end-of-life is already a probabilistic system.

If you need a tighter level of wear-leveling, LittleFS may not fit your use case. You could put LittleFS on top of an FTL layer like Dhara FTL, but this could come with its own complexity.

In theory LittleFS could also store the wear for each block and chose the optimal block, but this would have both code cost and runtime impacts without providing static wear-leveling. It could be interesting to explore though.

An interesting sidenote, this is similar to how most log-based filesystems/FTLs work. By allocating/writing all blocks in a linear cycle, you know all blocks are within +-1 erase without needing to store any other metadata.

It's also worth noting this scheme is more important for storage types such as SD/eMMC which doesn't support partial block writes. Without the ability to write multiple updates to metadata blocks, the performance degrades catastrophically and LittleFS behaves no differently than a traditional CoW filesystem where all wear ends up duplicated into the root block.

littlefs-project / littlefs

Is lfs_config.block_cycles the endurance of the device? #660