lfs_dir_getslice(): why do we subtract gdiff from gtag?

andriyndev commented 6 months ago

I cannot understand this exact detail. From what I understand, we are looking for a tag that matches masked gtag (pattern), why not then we just use gtag instead of gtag - gdiff? And what is actually gdiff (don't understand how and why it's calculated too)?

andriyndev commented 5 months ago

@geky please, help!

geky commented 5 months ago

Hi @andriyndev, sorry for the late response, this is a good question.

As an aside, why are these prefixed with 'g'? I honestly don't remember...

gdiff tracks the relative difference between the current id and id in the metadata log's history.

This is probably best explained with an example. Consider what happens if we create files a, b, c, and then delete a:

id        0 1 2
---------------
create a  .
          |
create b  | .
          | |
create c  | | .
          | | |
delete a  ' / /
           / /
          b c

If we want to lookup some tag on, say, b, we would start with id=0. Keep in mind we lookup backwards. When we see the "delete a", we need to increment giff so gtag+gdiff shows the current b's id (now id=1).

Eventually, when we reach "create b", we can report the tag we found, but we need to report the current id, not the one we found in the past. By tracking gdiff in a separate variable, we can keep track of what the id should be in the past, while also knowing what the current id should be.

Hopefully this info helps.

andriyndev commented 5 months ago

Thank you @geky ! Do I understand correctly, that when we delete a file, the id of each file, created after it, decreased by 1? If yes, then what exactly the id field of a tag contains? For example, we create a file, then delete it, and then create another file, what will be the id of the second file - 0 or 1 (actual and written in the corresponding tag field)?

andriyndev commented 5 months ago

By the way

0x401 LFS_TYPE_CREATE

Creates a new file with this id. Note that files in a metadata block don't necessarily need a create tag. All a create does is move over any files using this id. In this sense a create is similar to insertion into an imaginary array of files.

The create and delete tags allow littlefs to keep files in a directory ordered alphabetically by filename.

Do I understand correctly that it might contain an already used id, and in this case ids of already present files which bigger or equal to the "new" id, are incremented by 1, and the actual ids are calculated in runtime based on the whole data of the metadata blocks related to the folder, while the ids of older metadata entries remain unchanged on the disk? As an example, we can have 10 LFS_TYPE_CREATE metadata entries with id == 0 while the actual id of these files are 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 respectively, and if we add one more LFS_TYPE_CREATE metadata entry with id == 5, the actual ids will become 10, 9, 8, 7, 6, 4, 3, 2, 1, 0, 5?

geky commented 5 months ago

Do I understand correctly, that when we delete a file, the id of each file, created after it, decreased by 1?

Yep!

If yes, then what exactly the id field of a tag contains? For example, we create a file, then delete it, and then create another file, what will be the id of the second file - 0 or 1 (actual and written in the corresponding tag field)?

The second file will have id=0:

id             0 1
--------------------
create a id=0  .
               |
delete a id=0  '

create b id=0  .
               b

I probably should have annotated what ids were written to disk in the above example. Here's some more operations:

id        0 1 2
---------------
create a id=0  .
               |
create b id=1  | .
               | |
create d id=2  | | .
               | | |
delete a id=0  ' / /
                / /
create c id=1  | .\
               | | \
create a id=0  .\ \ \
               | | | |
delete a id=0  ' / / /
                / / /
               b c d

getslice(id=0) => file b
getslice(id=1) => file c
getslice(id=2) => file d

The key thing here is that the metadata block is a log, it's append-only and we can't modify any data that's already been writte.

So if we already wrote that file b was id=0, but we want to insert a file a at id=0 to maintain sorted order, we need to pretend that file b was id=1. That's what this scan/gdiff does.

This is also one reason all metadata accesses need to scan the log, we need to see if the id has changed since the metadata was written. But we need to scan the log to find the metadata in the first place, so this isn't too much of a penalty.

As an example, we can have 10 LFS_TYPE_CREATE metadata entries with id == 0 while the actual id of these files are 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 respectively, and if we add one more LFS_TYPE_CREATE metadata entry with id == 5, the actual ids will become 10, 9, 8, 7, 6, 4, 3, 2, 1, 0, 5?

I believe so:

id        0 1 2
---------------
create j id=0  .
               |
create i id=0  .\
               | \
create h id=0  .\ \
               | \ \
create g id=0  .\ \ \
               | \ \ \
create f id=0  .\ \ \ \
               | \ \ \ \
create e id=0  .\ \ \ \ \
               | \ \ \ \ \ 
create d id=0  .\ \ \ \ \ \
               | \ \ \ \ \ \
create c id=0  .\ \ \ \ \ \ \
               | \ \ \ \ \ \ \
create b id=0  .\ \ \ \ \ \ \ \
               | \ \ \ \ \ \ \ \
create a id=0  .\ \ \ \ \ \ \ \ \
               | | | | | \ \ \ \ \
create f id=5  | | | | | .\ \ \ \ \
               | | | | | | | | | | |
               a b c d e f g h i j k

getslice(id=0) => file a
getslice(id=1) => file b
getslice(id=2) => file c
getslice(id=3) => file d
getslice(id=4) => file e
getslice(id=5) => file f
getslice(id=6) => file g
getslice(id=7) => file h
getslice(id=8) => file i
getslice(id=9) => file j
getslice(id=10) => file k

These little diagrams help me understand what's going on.

and the actual ids are calculated in runtime based on the whole data of the metadata blocks related to the folder

Just a small note: The actual ids are calculated from only a single metadata block (sometimes refered to as mdir in the code). A folder is actually a linked-list of these metadata blocks, and each block is independent. This keeps operations on metadata bounded to the block size.

So to access a file, you need to know the file's mdir and its id in the mdir. Which can change due to unrelated filesystem operations.

andriyndev commented 5 months ago

Thank you! It's now clear how the id's are formed. By the way, it'd be great to add the information to DESIGN.md or SPEC.md, because it's non-trivial, and the description of LFS_TYPE_CREATE is unclear (at least it was for me).

Just a small note: The actual ids are calculated from only a single metadata block (sometimes refered to as mdir in the code). A folder is actually a linked-list of these metadata blocks, and each block is independent. This keeps operations on metadata bounded to the block size.

So to access a file, you need to know the file's mdir and its id in the mdir. Which can change due to unrelated filesystem operations.

Do I understand correctly that lfs_mdir is related to a single metadata block of a directory while lfs_dir - to the whole directory, and different lfs_mdir of the same directory can have different files with the same id? If yes, what if we want to delete a file from a previous lfs_mdir which is full and doesn't have space for the commit?

geky commented 5 months ago

Do I understand correctly that lfs_mdir is related to a single metadata block of a directory while lfs_dir - to the whole directory, and different lfs_mdir of the same directory can have different files with the same id?

Yep! The id is specific to the mdir. This makes the number of files in a directory effectively unbounded.

If yes, what if we want to delete a file from a previous lfs_mdir which is full and doesn't have space for the commit?

This is an excellent question because it touches on a problem that shows up in all copy-on-write filesystems.

Somewhat-simple answer:

When an mdir runs out of space and can no longer append a commit, it is compacted. This involves allocating a new mdir, and copying over the tags from the original mdir. At this point we can drop any deletes, pending or committed, and reclaim the space.

More complicated answer:

To prevent runaway compaction performance, we consider a compaction to fail if it doesn't reduce the mdir to 1/2 the block size. Though I guess another perspective is that all mdirs use 1/2 a block, and can overallocate by appending additional commits.

If compaction fails, we attempt to split the mdir. In this case we allocate 2 mdirs, and copy roughly 1/2 of the current tags into each. This is where a directory can expand into multiple mdirs.

Note that since an mdir can contain at most 1 block of tags, the split always results in mdirs that are roughly 1/2 full. So the split can't fail. Unless.

Even more complicated answer:

We can run out of blocks during an mdir split that is attempting to commit a delete tag. This rare, but possible situation results in a filesystem that is unrecoverable.

It's interesting to note that this issue is common in copy-on-write filesystems. To workaround this, all of the major CoW filesystems reserve some space for filesystem-specific bookmarking, ZFS, btrfs, etc.

littlefs does not do this currently, so it is currently possible to fill up a filesystem and make it unrecoverable. There are some plans to improve this with a ~88% full cutoff for "non-urgent" commits (https://github.com/littlefs-project/littlefs/issues/901 may be interesting).

andriyndev commented 5 months ago

Thank you! One more question...

if (lfs_gstate_hasmovehere(&lfs->gdisk, dir->pair) &&
            lfs_tag_id(gmask) != 0 &&
            lfs_tag_id(lfs->gdisk.tag) <= lfs_tag_id(gtag)) {
        // synthetic moves
        gdiff -= LFS_MKTAG(0, 1, 0);
    }

From what I understand, we check whether the metadata pair contains an entry which is in the process of moving, and if yes, treat it as deleted. But I cannot imagine the situation when we need to call lfs_dir_getslice() when in the process of moving, even after a power loss. Moreover, it's called only when the id of gtag is greater or equal than the id of lfs->gdisk.tag, and in this scenario I have doubts regarding validity of the id of gtag at all since the id of the original file/dir is in the process of change. At the same time, I didn't research the whole source code, so I'm not familiar with all the scenarios. By the way, if both gtag's and lfs->gdisk.tag's id can be 0, it looks like we can get an underflow when subtracting gdiff (id == 1) from gtag (id == 0).

geky commented 5 months ago

Two key things:

Moving between mdirs is not atomic, so we can lose power and be stuck "between a move".
We fix this situation as late as possible, on the next write operation. This gives us the nice invariant that readonly operations, and mount, will never write to disk.

The downside is we have to pretend the entry is deleted in all readonly operations.

lfs_stat, for example, indirectly call lfs_dir_getinfo, which calls lfs_dir_get, which calls lfs_dir_getslice (quite a few layers).

The other things is that, because of how our delete tags work, we need to adjust the ids of unrelated files:

id                       0 1
----------------------------
create a id=0            .
                         |
create b id=1            | .
                         | |
delete a id=0 (pending)  ' | <- not committed yet, need to pretend ids are shifted by one
                          /
                         b

By the way, if both gtag's and lfs->gdisk.tag's id can be 0, it looks like we can get an underflow when subtracting gdiff (id == 1) from gtag (id == 0).

Ah, it looks like you're right, this would break the code. There should probably be an assert that lfs_tag_id(lfs->gdisk.tag) != lfs_tag_id(gtag). Feel free to create a PR if you'd like.

Fortunately we just never happen to call lfs_dir_getslice when lfs_tag_id(lfs->gdisk.tag) == lfs_tag_id(gtag). The reason is we lookup filenames => id in a different function, lfs_dir_fetchmatch, and this function has a different bit logic that marks the sign bit to indicate the id was "not found":

https://github.com/littlefs-project/littlefs/blob/3513ff1afc1d67adb2e6f492f0b9bc0d798fcb0d/lfs.c#L1322-L1330

Upper layers return LFS_ERR_NOENT, or do some other action, so we should never actually call lfs_get_getslice on an id that has been removed.

The reason for doing this lookup in lfs_dir_fetchmatch is that it's cheaper in terms of both code and runtime ($O(n)$ vs $O(n \log n)$ binary search).

andriyndev commented 5 months ago

Thank you!

andriyndev commented 5 months ago

@geky One more question... In lfs_dir_fetchmatch():

            // directory modification tags?
            if (lfs_tag_type1(tag) == LFS_TYPE_NAME) {
                // increase count of files if necessary
                if (lfs_tag_id(tag) >= tempcount) {
                    tempcount = lfs_tag_id(tag) + 1;
                }
            } else if (lfs_tag_type1(tag) == LFS_TYPE_SPLICE) {

I wonder if it's OK if lfs_tag_id(tag) is larger than tempcount? As far as I understand, in this situation we'll get ids which are not related to any file so far between the old tempcount and the lfs_tag_id(tag), and this situation seems invalid.

andriyndev commented 5 months ago

And also...

        // found tag? or found best id?
        if (id) {
            *id = lfs_min(lfs_tag_id(besttag), dir->count);
        }

In which situations lfs_tag_id(besttag) can be larger than dir->count?

andriyndev commented 5 months ago

Fortunately we just never happen to call lfs_dir_getslice when lfs_tag_id(lfs->gdisk.tag) == lfs_tag_id(gtag)

What about lfs_fs_traverse_()?

        // iterate through ids in directory
        int err = lfs_dir_fetch(lfs, &dir, dir.tail);
        if (err) {
            return err;
        }

        for (uint16_t id = 0; id < dir.count; id++) {
            struct lfs_ctz ctz;
            lfs_stag_t tag = lfs_dir_get(lfs, &dir, LFS_MKTAG(0x700, 0x3ff, 0),
                    LFS_MKTAG(LFS_TYPE_STRUCT, id, sizeof(ctz)), &ctz);

andriyndev commented 5 months ago

By the way, what does dir->erased mean?

geky commented 5 months ago

Fortunately we just never happen to call lfs_dir_getslice when lfs_tag_id(lfs->gdisk.tag) == lfs_tag_id(gtag)

What about lfs_fs_traverse_()?

Oh! I think you're right. This is reachable.

I did some more investigation on the current behavior, and if this needs an urgent fix. I don't think it's urgent, but I went ahead and put up a PR so this will be fixed on the next patch release: https://github.com/littlefs-project/littlefs/pull/937

Feel free to comment on https://github.com/littlefs-project/littlefs/pull/937 if I missed anything related to this bug.

Will respond to the other questions in a separate comment.

geky commented 5 months ago

I wonder if it's OK if lfs_tag_id(tag) is larger than tempcount? As far as I understand, in this situation we'll get ids which are not related to any file so far between the old tempcount and the lfs_tag_id(tag), and this situation seems invalid.

Ah, so key thing is that LFS_TYPE_CREATE is optional. If a tag's id happens to be beyond the mdir's current size, the mdir is implicitly expanded.

The motivation for this is to simplify mdir compaction. mdir compaction (lfs_dir_compact) writes out tags in the order they are found in the log, which is arbitrary. It would be difficult to model this with CREATE/DELETE tags, since you would need to know what ids have already been written out, which would require unbounded memory.

That or increasing mdir compaction from $O(n^2) \rightarrow O(n^3)$, but $O(n^2)$ is already a problem...

Implicitly creating ids simplifies things, at the cost of more quirky behavior.

       // found tag? or found best id?
       if (id) {
           *id = lfs_min(lfs_tag_id(besttag), dir->count);
       }
In which situations lfs_tag_id(besttag) can be larger than dir->count?

I think this is just to map the default id, 0x3ff, to a reasonable value if we find no ids >= the requested tag.

besttag = -1 is used here to set all bits to one, lfs_tag_id(-1) => 0x3ff since all bits are set, and then lfs_min(0x3ff, dir->count) => dir->count if no other ids are found.

By the way, what does dir->erased mean?

dir->erased indicates if the remainder of the mdir is erased.

Our model of flash requires two steps:

A very large erase operation (block_size).
Much smaller prog operations (prog_size).

The metadata logs work with this model by incrementally writing to the smaller prog chunks inside a large erase block. But just because we have a valid log that only uses part of an erase block, it doesn't mean the rest of the block is necessarily erased.

The main problem is failed commits. If we try to commit to a log, but power is lost halfway through, we fallback to the previous valid state thanks to the checksums in our commits. But this leaves the remainder of the log partially written. Attempting to append a new commit will likely result in data corruption/retention issues.

So during lfs_dir_fetch littlefs needs to prove that the remainder of the log is erased before it can use it. This is where the FCRC tag comes in, and whether or not it's safe to continue writing to the log is stored in dir->erased.

andriyndev commented 5 months ago

Thank you!

littlefs-project / littlefs

lfs_dir_getslice(): why do we subtract gdiff from gtag? #920