littlefs-project / littlefs

A little fail-safe filesystem designed for microcontrollers
BSD 3-Clause "New" or "Revised" License
5.25k stars 804 forks source link

v2.7 : long delay in open/close for spi nor #965

Open matthieu-c-tagheuer opened 7 months ago

matthieu-c-tagheuer commented 7 months ago

Hi,

I started to do some test on littlefs. I have a simple test that do mkdir ("flash") for i = 0 ; i < 100; i++ remove("flash/test%d", i) open("flash/test%d", i) write(4k) close()

For some open/close, I see several second delay. I enabled trace in read/write/erase callback (see attached trace : log4). log4.txt

In case of long delay, there is lot's of 16B read : around 26780 read (see long_close.txt) ! long_close.txt

For example

Flash Write size 16 loc 9875456
Flash1 Read size 16 loc 9875456
Flash1 Read size 16 loc 9502720
Flash1 Read size 16 loc 9502736
Flash1 Read size 16 loc 9502752
Flash1 Read size 16 loc 9502768
Flash1 Read size 16 loc 9502784
Flash1 Read size 16 loc 9502800
Flash1 Read size 16 loc 9502816
Flash1 Read size 16 loc 9502832
Flash1 Read size 16 loc 9502848
Flash1 Read size 16 loc 9502864

Where come from this read ? Is that because there is too much files in a directory ?

Config :

const struct lfs_config lfs_cfg = {
    // block device operations
    .read  = lfs_spi_flash_read,
    .prog  = lfs_spi_flash_prog,
    .erase = lfs_spi_flash_erase,
    .sync  = lfs_spi_flash_sync,

    // block device configuration
    .read_size = 16,
    .prog_size = 16,
    .block_size = 4096,
    .block_count = (8*1024),
    .cache_size = 1024,
    .lookahead_size = 1024,
    .block_cycles = 500,

    .lock = lfs_lock,
    .unlock = lfs_unlock,
};

This a spi nor flash. The 26780 read are not really fast. It is not possible to merge continous read in one spi transaction ?

geky commented 7 months ago

Hi @matthieu-c-tagheuer, thanks for creating an issue.

All of these reads look like they're in the same block, so this is probably related to https://github.com/littlefs-project/littlefs/issues/783.

Long story short, littlefs metadata compaction algorithm is $O(n^2)$, which is not good. Fixing this requires some invasive low-level changes. The work is ongoing, but it will take some time before it is usable.

There are some workarounds you can try:

  1. Increasing cache_size such that cache_size = block_size may allow littlefs to keep the mdirs in RAM after fetch.

    Or it may not due to cache contention. This hasn't be explored all that well, but may be worth a try.

  2. If the number of separate bus transactions is a problem, increasing read_size may reduce the number. Though this will make each transaction larger.

    Internally, littlefs has a cache hint system to try to maximize transactions, but it's not perfect. A perfect transaction maximizer quickly turns into a full on branch predictor.

  3. Worst case you can set metadata_max to force littlefs to not use the full block for metadata:

    https://github.com/littlefs-project/littlefs/blob/6a53d76e90af33f0656333c1db09bd337fa75d23/lfs.h#L261-L265

    It's a heavy-handed solution, but it can be useful if the performance is worth the wasted storage.

matthieu-c-tagheuer commented 7 months ago

Hi @geky

I will try your suggestion.

I manage to reproduce it on pc with littlefs-2.9.1 and rambd

What I see it that we have lot's of call with hint=4. This don't use cache.

#1  0x0000555555555b0d in lfs_rambd_read (cfg=0x555555562d60 <lfs_cfg>, block=8097, off=0, buffer=0x555555565ed0, size=16) at pp/lfs_rambd.c:87
#2  0x0000555555556240 in lfs_bd_read (lfs=lfs@entry=0x5555555630c0 <lfs>, pcache=pcache@entry=0x0, rcache=rcache@entry=0x5555555630c0 <lfs>, hint=hint@entry=4, 
    block=8097, off=<optimized out>, off@entry=4, buffer=<optimized out>, size=<optimized out>) at lfs.c:117
#3  0x000055555555671a in lfs_dir_traverse (lfs=lfs@entry=0x5555555630c0 <lfs>, dir=dir@entry=0x7fffffffdd40, off=4, off@entry=0, ptag=ptag@entry=4294967295, 
    attrs=attrs@entry=0x7fffffffddf0, attrcount=attrcount@entry=2, tmask=1074789376, begin=0, end=51, diff=0, cb=0x555555555f6f <lfs_dir_commit_size>, data=0x7fffffffdc98, 
    ttag=0) at lfs.c:922
#4  0x0000555555559884 in lfs_dir_splittingcompact (begin=0, end=51, source=0x7fffffffdd40, attrcount=2, attrs=0x7fffffffddf0, dir=0x7fffffffdd40, lfs=0x5555555630c0 <lfs>)
    at lfs.c:2112
#5  lfs_dir_relocatingcommit (lfs=lfs@entry=0x5555555630c0 <lfs>, dir=dir@entry=0x7fffffffdd40, pair=pair@entry=0x7fffffffde9c, attrs=attrs@entry=0x7fffffffddf0, 
    attrcount=attrcount@entry=2, pdir=pdir@entry=0x7fffffffdd60) at lfs.c:2319
#6  0x0000555555559d70 in lfs_dir_orphaningcommit (lfs=lfs@entry=0x5555555630c0 <lfs>, dir=0x7fffffffde9c, attrs=0x7fffffffddf0, attrcount=2) at lfs.c:2400
#7  0x000055555555a524 in lfs_dir_commit (lfs=0x5555555630c0 <lfs>, dir=<optimized out>, attrs=<optimized out>, attrcount=<optimized out>) at lfs.c:2572
#8  0x000055555555a80d in lfs_file_sync_ (lfs=lfs@entry=0x5555555630c0 <lfs>, file=file@entry=0x7fffffffde90) at lfs.c:3439
#9  0x000055555555a83e in lfs_file_close_ (lfs=lfs@entry=0x5555555630c0 <lfs>, file=0x7fffffffde90) at lfs.c:3211
#10 0x000055555555be33 in lfs_file_close (lfs=0x5555555630c0 <lfs>, file=0x7fffffffde90) at lfs.c:6096

There also some cache where we parse the flash in reverse order : (hint is also small)

#0  lfs_bd_read (lfs=lfs@entry=0x5555555630c0 <lfs>, pcache=pcache@entry=0x0, rcache=rcache@entry=0x5555555630c0 <lfs>, hint=hint@entry=4, block=8097, off=off@entry=4050, 
    buffer=0x7fffffffdb0c, size=4) at lfs.c:112
#1  0x00005555555563c2 in lfs_dir_getslice (lfs=lfs@entry=0x5555555630c0 <lfs>, dir=dir@entry=0x7fffffffdd40, gmask=gmask@entry=2146435072, gtag=gtag@entry=2146435084, 
    goff=goff@entry=0, gbuffer=gbuffer@entry=0x7fffffffdb74, gsize=12) at lfs.c:729
#2  0x00005555555564dc in lfs_dir_get (lfs=lfs@entry=0x5555555630c0 <lfs>, dir=dir@entry=0x7fffffffdd40, gmask=gmask@entry=2146435072, gtag=gtag@entry=2146435084, 
    buffer=buffer@entry=0x7fffffffdb74) at lfs.c:775
#3  0x0000555555556501 in lfs_dir_getgstate (lfs=lfs@entry=0x5555555630c0 <lfs>, dir=dir@entry=0x7fffffffdd40, gstate=gstate@entry=0x7fffffffdbcc) at lfs.c:1387
#4  0x0000555555559380 in lfs_dir_compact (lfs=lfs@entry=0x5555555630c0 <lfs>, dir=dir@entry=0x7fffffffdd40, attrs=attrs@entry=0x7fffffffddf0, attrcount=attrcount@entry=2, 
    source=source@entry=0x7fffffffdd40, begin=begin@entry=0, end=51) at lfs.c:2028
#5  0x0000555555559aa3 in lfs_dir_splittingcompact (begin=0, end=51, source=0x7fffffffdd40, attrcount=2, attrs=0x7fffffffddf0, dir=0x7fffffffdd40, lfs=0x5555555630c0 <lfs>)
    at lfs.c:2201
#6  lfs_dir_relocatingcommit (lfs=lfs@entry=0x5555555630c0 <lfs>, dir=dir@entry=0x7fffffffdd40, pair=pair@entry=0x7fffffffde9c, attrs=attrs@entry=0x7fffffffddf0, 
    attrcount=attrcount@entry=2, pdir=pdir@entry=0x7fffffffdd60) at lfs.c:2319
#7  0x0000555555559db3 in lfs_dir_orphaningcommit (lfs=lfs@entry=0x5555555630c0 <lfs>, dir=0x7fffffffde9c, attrs=0x7fffffffddf0, attrcount=2) at lfs.c:2400
#8  0x000055555555a567 in lfs_dir_commit (lfs=0x5555555630c0 <lfs>, dir=<optimized out>, attrs=<optimized out>, attrcount=<optimized out>) at lfs.c:2572
#9  0x000055555555a850 in lfs_file_sync_ (lfs=lfs@entry=0x5555555630c0 <lfs>, file=file@entry=0x7fffffffde90) at lfs.c:3439
#10 0x000055555555a881 in lfs_file_close_ (lfs=lfs@entry=0x5555555630c0 <lfs>, file=0x7fffffffde90) at lfs.c:3211
#11 0x000055555555be76 in lfs_file_close (lfs=0x5555555630c0 <lfs>, file=0x7fffffffde90) at lfs.c:6096
    // iterate over dir block backwards (for faster lookups)
    while (off >= sizeof(lfs_tag_t) + lfs_tag_dsize(ntag)) {
        off -= lfs_tag_dsize(ntag);
        lfs_tag_t tag = ntag;
        int err = lfs_bd_read(lfs,
                NULL, &lfs->rcache, sizeof(ntag),
                dir->pair[0], off, &ntag, sizeof(ntag));

In case of iteration over tag, tag containing tag (big one) and small tag are mixed together ? If we know that we mostly have small tag, we could increase the hint size when iterating ?

And when doing in reverse order, we could also use the cache. For example pass a negative hint, that will indicate to the cache to not read block:offset to block:offset + hint, but something like block:offset - hint + size to block:offset + size. So that next call hit the cache.

geky commented 7 months ago

In case of iteration over tag, tag containing tag (big one) and small tag are mixed together ?

Sorry if I don't quite understand. I think you're asking if our tags are very different in size? (oh that is hard to phrase).

The problem is we don't really know, it depends on exactly what metadata gets stored.

Worst case files can be inlined, which results in rather large tags:

00000000: tt tt tt tt 6c 6f 67 2e 74 78 74 tt tt tt tt 54  ....log.txt....T
00000010: 68 65 20 63 6f 6e 74 65 6e 74 73 20 6f 66 20 79  he contents of y
00000020: 6f 75 72 20 6c 6f 67 20 66 69 6c 65 0a           our log file.

This leads to a tricky tradeoff when it comes to how much to read. On one hand, hint=4 means we won't read data we won't end up using, which is optimal if bus transactions are cheap. On the other hand, as you note, we miss opportunities to combine multiple tag reads into a single transaction.

Some of the problem is if we don't read hint=4, what do we set hint to? hint=cache_size means increasing the cache_size can potentially harm performance reading more than we need to, which may confuse users.

But this is an interesting area to explore, maybe exposing a hint_size in the lfs_config could be useful. Though optimizing at the caching level has been low priority vs other levels.

If we know that we mostly have small tag, we could increase the hint size when iterating ?

This can be controlled by read_size. read_size effectively controls the minimum read transaction.

The hint just tells littlefs to increase the read transaction if we know we will need more data eventually.

Actually, one question for the hypothetical hint_size, what would it add that can't be accomplished by increasing read_size? I think read_size may be sufficient for tuning storage with high bus transaction cost.

And when doing in reverse order, we could also use the cache. For example pass a negative hint, that will indicate to the cache to not read block:offset to block:offset + hint, but something like block:offset - hint + size to block:offset + size. So that next call hit the cache.

This is quite a neat idea.

Unfortunately there's quite a bit of work in progress on this data structure in experimental branches. The good news is hopefully we will eventually be able to reduce the $O(n^2)$ compaction down to $O(n \log n)$. The bad news is this will become much less cache friendly...

Instead of unconditionally traversing backwards, the planned design involves conditional branches backwards. This is where a perfect hint system starts to looks like a perfect branch predictor...

matthieu-c-tagheuer commented 7 months ago

Hi,

Thanks,

I have did some experiement.

First metadata size should fit in cache, to avoid too many cache miss.

--- a/lfs.c
+++ b/lfs.c
@@ -934,7 +934,7 @@ static int lfs_dir_traverse(lfs_t *lfs,
             if (off+lfs_tag_dsize(ptag) < dir->off) {
                 off += lfs_tag_dsize(ptag);
                 int err = lfs_bd_read(lfs,
-                        NULL, &lfs->rcache, sizeof(tag),
+                        NULL, &lfs->rcache, lfs->cfg->block_count,
                         dir->pair[0], off, &tag, sizeof(tag));
                 if (err) {
                     return err;

improved caching

I implemented the reverse caching, it is working for some cases.

Also read_size should be small 4 in order to avoid to trash cache : There are cache where we do

if read_size in not 4 byte, we will use the cache buffer as intermediate buffer an trash it. For system having a bigger read constraint a small intermediate buffer other than cache could be good.

I will provide more info tomorrow and patches

Some of the problem is if we don't read hint=4, what do we set hint to? hint=cache_size means increasing the cache_size can potentially harm performance reading more than we need to, which may confuse users.

We have already the same problem in order place of the code. I have a patch to print some cache stats. In lfs_dir_fetchmatch [1] we force hint to cache_size, but in my scenario use only a small part of this cache.

But this is an interesting area to explore, maybe exposing a hint_size in the lfs_config could be useful. Though optimizing at the caching level has been low priority vs other levels.

Yes it could be interesting is having cache_size and cache_fill_size. And it could be possible to fill in several step.

[1] // extract next tag lfs_tag_t tag; off += lfs_tag_dsize(ptag); int err = lfs_bd_read(lfs, NULL, &lfs->rcache, lfs->cfg->block_size, dir->pair[0], off, &tag, sizeof(tag));

geky commented 7 months ago

Other question, for big write, why data is going thought cache and not direct write ? I have a patch for that.

It should be bypassing the cache if the incoming buffer is larger than the cache. Though there may be opportunities where this is missed. Feel free to fork and create a pull request.

Also why validate is not an option ? Do we really want to always read and compare the data we write ?

Really just because the config system needs work. Validating is a slightly better default because it allows us to detect/recover from write errors.

There's nothing stopping us from making validation optional, but it would require either an additional #ifdef or a bool/flags in lfs_config.

There are also plans for additional validation options, such as metadata/data checksumming, so making write validation optional has been low-priority since it will probably be easier to introduce this with other related API changes.

why don't we have separate cache size for read/write/file cache ? write cache can be smaller than read cache.

This has been requested a couple times. It makes sense to separate these cache sizes.

The main issue is backwards compatibility. It would be tricky to implement separate cache sizes without breaking the current API. There are also other API-breaking changes in the works right now, and breaking the API once is better than breaking the API twice.

In lfs_dir_fetchmatch [1] we force hint to cache_size

lfs_dir_fetchmatch is notably different from lfs_dir_traverse in that we need to checksum the metadata, so we will end up needing to read everything. It is interesting that this goes unused if the mdir is mostly empty, but I'm not sure what an alternative strategy would look like.

Yes it could be interesting is having cache_size and cache_fill_size.

Is this significantly better than increasing read_size? There may be some cases where you can benefit from more flexible alignment, but I'm not sure that will really be measurable...


Something else that should be mentioned is these caching tradeoffs are all being made with code/ram size in mind. Caching logic can get very complicated, and even if it results in performance benefits, it might not be the right tradeoff for littlefs.

Just something to keep in mind. make summary can show the current costs quickly, and make size, make stack, make structs, have function-level info.

The cost can be ignored if we introduce additional caching logic behind ifdefs, but we should probably limit these options to documentable caching strategies that users can understand. If that makes sense.

matthieu-c-tagheuer commented 7 months ago

Hi,

here my PR with my current patches : https://github.com/littlefs-project/littlefs/pull/977

Some may need some cleaning, improvement other may be specialize for our case

before patches

                            code    data   stack structs
TOTAL                      26212       -    2192     948

after patches (LFS_NO_CHECK_DATA enable)

                            code    data   stack structs
TOTAL                      26283       -    2160     964

Is this significantly better than increasing read_size? There may be some cases where you can benefit from more flexible alignment, but I'm not sure that will really be measurable...

read_size = 8 vs read_size = 16

total read=14740616 write=4287328
total num read=40923 write=4729

vs

total read=15094160 write=4287328
total num read=40965 write=4729

With read_size = 16, there less caching (40965 direct flash read vs 40923) and the volume of raw flash read is bigger (15094160 vs 14740616). But it is only a small improvement.

With read_size = 256 and not hint modified in original code

total read=42766592 write=4287328
total num read=111973 write=7732

With read_size = 4 and [lfs_dir_traverse: better try to use cache] / [lfs_dir_getslice: reverse caching] patches

total read=18852728 write=4287328
total num read=45678 write=7732
matthieu-c-tagheuer commented 7 months ago

Hi I have updated my PR, and now test pass work.

I removed the patch to by pass validated. Even if it was an option, if fail to pass badblock test when enabled : silent write/erase failure are not working well.

Also read error also fail the test. I think this is because test read error are triggered during the write (when doing the write). Now there are triggered in other place of the code and break the test.

I also removed the patch to have different size for the cache. I break the test_compat. I have not sure to understand how this test work. Does it check binary compat of the struct ?

The patch added 2 new entry in lfs_config . But if there are 0, they will default to old behaviour

This 2 patches are on https://github.com/matthieu-c-tagheuer/littlefs/commits/performance2/

For the size comparison, ASSERT are used ?

geky commented 6 months ago

Hi @matthieu-c-tagheuer, sorry to the late response.

With read_size = 256 and not hint modified in original code

Is this still with cache_size = 1024? To be comparable you probably need read_size = cache_size. Though it may not actually matter.

I'm also curious about lfs_dir_traverse changes vs reverse-caching changes. I don't think reverse-caching changes will survive other ongoing work unfortunately...

I removed the patch to by pass validated. Even if it was an option, if fail to pass badblock test when enabled : silent write/erase failure are not working well.

The validation step is how we detect write errors, which is what test_badblocks tests, so that makes sense.

This test could be made optional if validation is disabled. It would just be up to users to decide if the performance-vs-write-error-detection tradeoff is worth it.

I also removed the patch to have different size for the cache. I break the test_compat. I have not sure to understand how this test work. Does it check binary compat of the struct ?

It sounds like this may be an actual bug. Currently littlefs assumes all caches are the same size, and there is some hackiness where it relies on this (mainly here).

The test_compat tests check for backwards compatibility with the previous version. It's possible because of difference in cache sizes in the two versions different code is being executed.

For the size comparison, ASSERT are used ?

Sorry I'm not sure I understood this question. Is there an assert triggering that is confusing?


I was thinking about how best to add additional caching features to littlefs. One thought is, could this be introduced as an optional block device?

So if a user wants to use cache_fill_size, they could wrap their block device in a sort of lfs_cachingbd_t:

my_nor_flash_t my_nor_bd;

lfs_cachingbd_t cachingbd;
struct lfs_cachingbd_config {
    .context = &my_nor_bd
    .read = my_nor_flash_read,
    .prog = my_nor_flash_prog,
    ...
} cachingbd_cfg;
lfs_caching_bd_create(&cachingbd, &cachingbd_cfg) => 0;

lfs_t lfs;
struct lfs_config {
    .context = &cachingbd,
    .read = my_nor_flash_read,
    .prog = my_nor_flash_prog,
    ...
}
lfs_mount(&lfs, &lfs_cfg) => 0;

I realize this would add quite a bit of work, but the result would be quite flexible, allowing users to decide if they want this level of caching, or to implement their own caching strategies.

Unfortunately the current block device layer will probably make this more annoying than it needs to be. The bd API needs to be revisited...

Thoughts?