Top root block(s) wear leveling

MaJerle commented 3 months ago

Reading the design guide, specs and some of the topics, it is not fully clear how littlefs manages the root block on a physical disk with wear leveling. My understanding of the COW and backward linked-list + CTZ-lists is that if we append data to the file, we are effectively increasing number of blocks, therefore the root must also be updated to point to the latest entry in the list.

Some questions I'm unable to answer:

How does wear leveling work for root blocks?
Will littlefs perform a scan of full disk (all blocks) and find most recent root information, before doing anything else?

EternityForest commented 2 months ago

As far as I can tell, the root block points to the root directory.

Every time we change the root dir N times, we have to move it, which would require rewriting the root block.

The spec IIRC says the superblock is only rarely written, when the root dir needs to be moved.

But N is usually like 1000, so it usually works just fine?

Seems like if you regularly changed stuff in the root dir, you could have an issue after hundreds of millions of changes, so perhaps the documentation should tell us to keep frequently changed stuff in a subdir?

MaJerle commented 2 months ago

For me, you always have to know where "end of" linkedlist is, otherwise how do you know where to start? I may not fully understand this part of littlefs.

And if you modify subdir file, then you also need to inform parent to point to the new end of file, which needs to know to point to the new one. Then you have to modify root again, no?

EternityForest commented 2 months ago

I think they are modifying the parent pointer in place, so the pointer to it remains valid, it's the same exact physical block.

They only move it every 1000 updates or so, so each level up in the heirarchy gets 1000 times fewer write cycles than the one before it.

On Wed, Sep 25, 2024, 8:17 AM Tilen Majerle @.***> wrote:

For me, you always have to know where "end of" linkedlist is, otherwise how do you know where to start? I may not fully understand this part of littlefs.

And if you modify subdir file, then you also need to inform parent to point to the new end of file, which needs to know to point to the new one. Then you have to modify root again, no?

— Reply to this email directly, view it on GitHub https://github.com/littlefs-project/littlefs/issues/1020#issuecomment-2374222112, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFZCH6FZIGSLQAMZS4EU2DZYLAWBAVCNFSM6AAAAABNBYTKBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZUGIZDEMJRGI . You are receiving this because you commented.Message ID: @.***>

MaJerle commented 2 months ago

I'm kinda confused. If littlefs performs update "in place", then this destroys the whole point of wear leveling in my view.

Or am I wrong?

EternityForest commented 2 months ago

If I understand it right, it only does the in place update 1000 times, then it moves on to another pair of blocks, the only thing that permanently stays in the same place is the superblock which is only very rarely written to

On Wed, Sep 25, 2024, 1:30 PM Tilen Majerle @.***> wrote:

I'm kinda confused. If littlefs performs update "in place", then this destroys the whole point of wear leveling in my view.

Or am I wrong?

— Reply to this email directly, view it on GitHub https://github.com/littlefs-project/littlefs/issues/1020#issuecomment-2375075018, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFZCHYNKC2N4U2BQQOXTY3ZYMFMTAVCNFSM6AAAAABNBYTKBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZVGA3TKMBRHA . You are receiving this because you commented.Message ID: @.***>

geky commented 2 months ago

Hi @MaJerle, @EternityForest, thanks for creating an issue, sorry about the late response.

Forgetting to include superblock expansion in DESIGN.md was an oversight. There's a terse description in SPEC.md, but that's sort of the wrong place for it. I thought for sure I wrote more on the feature, but I think it was only this commit: 7c70068, and never made it made its way into DESIGN.md...

Which is a shame because I think it's quite neat.

TLDR

To keep writes to the superblock anchor as low as possible, littlefs increases the length of the superblock chain every block_cycles erases to the anchor, which is when littlefs would normally try to wear-level the mdir. This results in exponentially fewer writes to the superblock anchor over time.

High-level

The problem is interesting: How do you find the superblock/root of the filesystem? You can scan the disk for it, but this gets expensive. You can store it at a fixed location, but then it wouldn't participate in wear-leveling and risk an early death of the filesystem.

The solution in littlefs is sort of a form of exponential backoff. We store a superblock at a fixed location (blocks {0,1}), but write to it exponentially less often.

The way this works is that we actually have a linked-list of superblocks, with the head being the "anchor" superblock, and the tail being the "active" superblock. The anchor superblock is always at blocks {0,1}, which makes finding it easy, while the other superblocks in the chain participate in wear-leveling and move around the disk:

blocks {0,1}
     |
     v
 .--------.  .--------.  .--------.  .--------.  .--------.
.| anchor |->| super  |->| super  |->| super  |->| active |
|| super  | || block  | || block  | || block  | || super  |
|| block  | ||        | ||        | ||        | || block  |
|'--------' |'--------' |'--------' |'--------' |'---|----'
'--------'  '--------'  '--------'  '--------'  '----v---'
                                              rest of the filesystem

For the most part, we only write to the active superblock. However, moving the superblock as a part of wear-leveling requires updating the pointer in its parent. This causes writes to slowly propagate up the superblock chain, and eventually reach the superblock anchor.

Here's the main trick: Everytime we accumulate enough writes to want to move the superblock anchor, we instead increase the length of the superblock chain by 1.

Example

This is all controlled by the block_cycles config option.

At first, after formatting, the filesystem contains a single superblock:

blocks {0,1}  total erases: 0
     |        anchor erases: 0
     v
 .--------.
.| anchor |
|| super  |
|| block  |
|'---|----'
'----v---'
rest of the filesystem

After block_cycles erases to the superblock, let's say 1000, we add a new active superblock:

blocks {0,1}  total erases: 1000
     |        anchor erases: 1000
     v
 .--------.  .--------.
.| anchor |->| active |
|| super  | || super  |
|| block  | || block  |
|'--------' |'---|----'
'--------'  '----v---'
          rest of the filesystem

This new superblock doesn't need to move until another block_cycles erases, at which point we need to update the superblock anchor:

blocks {0,1}  total writes: 2000 
     |        anchor erases: 1001
     v
 .--------.  .--------.
.| anchor |->| active |
|| super  | || super  |
|| block  | || block  |
|'--------' |'---|----'
'--------'  '----v---'
          rest of the filesystem

This continues until the superblock anchor reaches block_cycles erases, at which point we extend our superblock chain again:

blocks {0,1}  total writes: 1001000
     |        anchor erases: 2000  
     v
 .--------.  .--------.  .--------.
.| anchor |->| super  |->| active |
|| super  | || block  | || super  |
|| block  | ||        | || block  |
|'--------' |'--------' |'---|----'
'--------'  '--------'  '----v---'
                      rest of the filesystem

This goes on forever:

blocks {0,1}  total writes: 1001001000
     |        anchor erases: 3000 
     v
 .--------.  .--------.  .--------.  .--------.
.| anchor |->| super  |->| super  |->| active |
|| super  | || block  | || block  | || super  |
|| block  | ||        | ||        | || block  |
|'--------' |'--------' |'--------' |'---|----'
'--------'  '--------'  '--------'  '----v---'
                                  rest of the filesystem

Math

The funny thing about exponential growth is that it's exponential.

With this scheme, after $x$ writes to the root, we should end up with $n$ superblocks:

$$ n = 1 + \log_c x $$

Where $c$ is the configured block_cycles.

And since the each superblock requires writing to the anchor $c$ times, we can assume an upper bound on the number of erases to the superblock anchor, $e$:

$$ e = c \cdot n $$

Putting these two together we can solve for how many writes, $x$, are needed to erase the anchor $e$ times:

$$ e = c \cdot \left(1+\log_c x\right) $$

$$ \log_c x = \frac{e}{c}-1 $$

$$ x = c^{\frac{e}{c}-1} $$

So lets say we have configured block_cycles=1000 on a disk with 10K erase cycles. How many writes would be needed to make the superblock anchor to go bad?

$$ x = 1000^{\frac{10000}{1000}-1} $$

$$ x = 1000^{9} $$

$$ x = 10^{27} $$

So $10^{27}$ or ~1000,000,000,000,000,000,000,000,000 total writes.

But each write is also wearing out the other blocks on the disk. So to even get to this point, you'd need $\frac{10^{27}}{10000}$ or $10^{23}$ blocks. With 4KiB blocks, that's a ~338 YiB disk.

Notes

Sorry if I'm repeating some of what @EternityForest commented, I started writing this yesterday but putting it all together was a bit complicated.

Some other things to note:

Some of the motivation for this design was the observation that many flash devices offer hardened/better tested flash for the first couple blocks. I think since chip makers know filesystems usually put superblocks/partition tables/important stuff there.
One design goal was for littlefs to work with only two blocks. This is common in really small internal flash, and is why littlefs starts with only a single merged superblock+root mdir.
There are a couple extra checks to prevent superblock expansion if the disk gets too full.
I'm being a bit handwavey with erases vs writes. Our metadata blocks are actually logs, and can usually be committed to multiple times before needing an erase.

Fortunately you can model multiple commits-per-erase by saying $c$ is equal to commits*block_cycles. If you include commits-per-erase, the chance of wearing out the superblock anchor gets vanishingly small.
As @EternityForest mentioned, the filesystem layout affects how writes propagate up the tree. It's possible that the superblock will never expand if you have things in subdirectories.

That being said, the intention of superblock expansion is to grow based on your write pattern. So I wouldn't proactively move things into subdirectories unless you have measureable issues (performance is a different story).

geky commented 2 months ago

Some other TLDR because the above turned into a text wall

Will littlefs perform a scan of full disk (all blocks) and find most recent root information, before doing anything else?

No, the superblock anchor is always at blocks {0,1}.

Seems like if you regularly changed stuff in the root dir, you could have an issue after hundreds of millions of changes, so perhaps the documentation should tell us to keep frequently changed stuff in a subdir?

The superblock expansion scheme should adapt to your directory layout. After one expansion (block_cycles erases), writing to /a.txt is equivalent to writing to /a/b.txt.

MaJerle commented 2 months ago

Thanks for extensive explanation, few things are more clear. So we use the "normal" linked list (not backwards one) for superblocks.

What I struggle to understand for some reason (please point me where I can read more) is how do we achieve logarithmic part and not linear. Why?

If you have, let's say, 1k erase cycle of the underlying flash, and we put 1000 as a cycling of the block, then, in my understanding:

Every time we write data to flash to our file, we append the entry to the backward file link list.
When we do this, we have to update the parent now, so that parent knows where last entry exists to do backward linkedlist search
When we do that, we have to update parent again
And finally we have to update the super block. Every update of the block involves the erase. So we can only perform 1k erases before we wear out and have to find new block, no?

The log mathematics, in my view, explicitly says that when we modify one block 1000 times, we will modify another block 1 time, and we will AGAIN modify previous block 1000 times. But the block already being erased 1000 times will wear out, no?

Don't we update the superblock every time we write the file? If an answer is NO, where do we store the pointer to the last block allocated for the file that we just wrote to? Because how otherwise root dir knows where the file is to search for with backwards linkedlist? There must be a chain somewhere, somehow?

geky commented 2 months ago

Sorry if I'm misunderstanding the question.

If you have, let's say, 1k erase cycle of the underlying flash, and we put 1000 as a cycling of the block, then, in my understanding:

I think this might be the confusion.

block_cycles should always be at least an order of magnitude less that the supported erase cycles of the flash. So if your flash supports 1K erase cycles, block_cycles=10 or block_cycles=50 is more reasonable.

Most NOR flash I've seen supports ~100K, NAND flash usually ~10K, so block_cycles=1000 is reasonable in that context. Though you should always reference your device's datasheet.

You're right that block_cycles=1000 on flash with 1K would not work and probably break the device early (though with multiple commits per erase due to logging, and the fact that manufacturers advertise conservative minimums, you would probably be fine).

Don't we update the superblock every time we write the file? If an answer is NO, where do we store the pointer to the last block allocated for the file that we just wrote to? Because how otherwise root dir knows where the file is to search for with backwards linkedlist? There must be a chain somewhere, somehow?

You're correct. If you store a file in the root, every write to the file adds a commit to the root mdir/superblock.

geky commented 2 months ago

I do think block_cycles is a bad name, and plan to change it if/when there is a major API change.

Current ideas are block_recycles or wl_aggression, though I'm open to suggestions.

MaJerle commented 2 months ago

You're correct. If you store a file in the root, every write to the file adds a commit to the root mdir/superblock.

What I'm missing in the story, how, even if you store the file in the subdirectory, can littlefs find last file information? If it starts from root search that has forward linked list, then there must be an update to this list everytime when you update any file in the system.

ROOT -> SUBDIR1 -> FILE1

File1 gets updated, new entry is added to the end of backward list (which now becomes entry 1st).

ROOT -> (must point to new SUBDIR1 end) SUBDIR1 -> (must point to new end of FILE1) FILE1

So how do you update the chain from new file entry up to the top of chain to be able to find entries on a read/write next time

geky commented 2 months ago

ROOT -> (must point to new SUBDIR1 end) SUBDIR1 -> (must point to new end of FILE1) FILE1

Ah, so directories in littlefs are not represented the same as files (DESIGN.md#directories).

Files are stored as backwards skip-lists, but directories are stored as "forward" linked-lists similar to the superblock chain. This is possible because the directory blocks are also metadata logs, and can be updated in-place (up to block_cycles for wear-leveling).

The tradeoff is filename lookup is $O(n)$, which isn't great if you have many files in a single directory (there are some plans to improve this).

MaJerle commented 2 months ago

OK thanks - that clue was helpful.

From flash standpoint, if I cite your comment you linked above:

From here:

.--------.  1 root write = 1 anchor write
| super  |  100 writes until expansion
| block  |
| + root |
'--------'

To here

.--------.  .--------.  100 root writes = 1 anchor write
| super  |->| super  |  100*100 = 10,000 writes until expansion
| block  |  | block  |
|        |  | + root |
'--------'  '--------'

and later here:

.--------.  .--------.  .--------.  10,000 root writes = 1 anchor write
| super  |->| super  |->| super  |  100*100*100 = 1,000,000 writes until expansion
| block  |  | block  |  | block  |
|        |  |        |  | + root |
'--------'  '--------'  '--------'

If we compare this logic to the counter in a bit, let's say increasing length of number by byte every time we want to increase size:

1 byte:  0-255
2 bytes: 0 - 65535
3 bytes: 0 - xxx
4 bytes: 0 - 4294967295

Numbers are growing fast. We need log2 bits to represent them. But if we look at the level of bits, and for the sake of above example, let's say that we have 1 bit for 1 block, so we have now 3 blocks:

000 = first write
001 = second write, we have to expand

// second bit comes to game
// but LSB bit is again being toggled
010 = third write
011 = fourth write

//
100 = 5th write
101 = 6th write

As you can see to apply the logic of 2^blocks, LSB bit is changing every time, aka every bit change == flash erase. This means that bit 1 will wear extremely quickly out. So we cannot change the bit every time we increase writes.

As you can see, to achieve the 2^3 bits, LSB bit has to change the value 2^3 times in this count.

Essentially, I do not understand why writes grow log(x) in the root, and not linearly as we see above in my bit example?

When first block performs block_cycles writes, it (let's consider factor 10 less) can be rewritten again later only 10x again. So we have to move to the next block and NOT touch previous one anymore. We repeat the same for second block, we can only modify it 10x times full write. This is to me 2*10x because when we are modifying second block, we cannot modify first one again as it will wear out. This is opposite to the bit counting.

Don't know, maybe I just need to sleep more.

geky commented 2 months ago

If we compare this logic to the counter in a bit, let's say increasing length of number by byte every time we want to increase size:

Viewing block cycles as digits in a number is a good way to think about.

As you can see to apply the logic of 2^blocks, LSB bit is changing every time, aka every bit change == flash erase. This means that bit 1 will wear extremely quickly out. So we cannot change the bit every time we increase writes.

Ah, but after the first expansion, bit 1 is now free to move around the disk. Every time it "flips" we move it to a new block as a part of wear-leveling. If this wears out to the point of failure, which is possible, most of the disk should be reaching its end of life.

So we're really only worried about the blocks that are not participating in wear-leveling, which is only the superblock anchor. This is where the log comes from:

12 -> 34 -> 56 -> 78
 ^    '------.-----'
 |    moving around disk
 |
fixed location, but written c*(1+log_c(x))

Don't know, maybe I just need to sleep more.

This is why my initial response took so long, it can get confusing :)

littlefs-project / littlefs