Bypassing cache is a problem for cases which require aligned buffers

FreddieChopin commented 5 years ago

In littlefs-v1 there are two cases where cache buffers can be bypassed: https://github.com/ARMmbed/littlefs/blob/d3a2cf48d449f259bc15a3a1058323132f3eeef7/lfs.c#L48 https://github.com/ARMmbed/littlefs/blob/d3a2cf48d449f259bc15a3a1058323132f3eeef7/lfs.c#L187

In littlefs-v2 only one is left: https://github.com/ARMmbed/littlefs/blob/4ad09d6c4ec95c126909b905900237cdc829c6b0/lfs.c#L85

This is not an issue if the underlying device doesn't care much about alignment, but is fatal for devices which actually do. For example in STM32 if SDMMC peripheral is used with DMA, it requires the buffers to be aligned to 16 bytes. If this chip is STM32F7 or some other chip with caches, the alignment is increased to 32 bytes. My littlefs port tries to take this into account by manually providing properly aligned read and program buffers in the filesystem object and also properly aligned per-file buffer (I'm opening files with lfs_file_opencfg()). However in above mentioned cases of buffer bypassing, littlefs will pass the buffer it got "externally" directly to the underlying device. This buffer may be properly aligned but only by pure coincidence.

I think that requiring the user to deal with alignment is not an option when the filesystem already provides caching/buffering (unless the checks for bypassing it will also include required buffer alignment). It may not be such a big issue if you use littlefs directly, but after wrapping it in stdio FILE you loose most of control over internal FILE buffering (you could try to use setvbuf() but this is not very convenient when dealing with something as high-level as FILE). Therefore I think that all cases of cache bypassing should be removed. It is hard to deal with this issue in the underlying device, because it would require yet another level of buffering/caching, which seems like a waste of RAM, code and speed.

rbphilip commented 5 years ago

I wondered about this cache bypass myself just yesterday. My underlying hardware doesn't care, but I was surprised to get a programming request for 4KB when .prog_size was set to 128. Fortunately my underlying I/O breaks up writes & reads to multiple appropriately sized physical I/Os .

I'm guessing that the cache bypass is for improving performance & reducing copying of data and that's sometimes valid.

Did your second commit resolve your lfs_cache_prog assertion?

FreddieChopin commented 5 years ago

Did your second commit resolve your lfs_cache_prog assertion?

No, both of them are the same and just remove cache bypassing. I'm not sure about the nature of the assert, as it seems to me that removing optional cache bypass should not alter the behaviour of the code in a way that would trigger an assert, so it may as well be a completely unrelated issue. However this may as well be related to #161 - I did not investigate that much. It looks as if the code just does not deal properly with the case when cache has some data which is not written (because there was an underlying i/o problem) and the user tries to write some more, to a different block. It's best for @geky to share his opinion here, but he's probably extremely busy with something else, as there is not much going on with littlefs since ~5-6 months.

rbphilip commented 5 years ago

OK. Thanks. You've obviously got a lot more history with LittleFS. I have it working well enough on my board, but I'm a little concerned with it being a piece of open source software transitioning to a new version without any obvious development going on. It seems to work, but the developer is very quiet. V2 is in "alpha" but the few entries I've seen asking about it go unanswered.

Trying to figure out if it's ultimately better for my customer to pay $7K for a commercial product with a development team actively supporting it.

FreddieChopin commented 5 years ago

You've obviously got a lot more history with LittleFS.

Risky statement (; I wouldn't call myself that way (;

being a piece of open source software transitioning to a new version without any obvious development going on. It seems to work, but the developer is very quiet. V2 is in "alpha" but the few entries I've seen asking about it go unanswered.

Unfortunately, in my private opinion the whole v2 pull-request is an example of "how not to develop software". A branch which brings ~150 commits to a repo with ~170 commits (almost ~100% increase), bringing dozens of new features (most of them not very related to others), being developed and rebased for over a year - this just cannot end well because of the amount of work required to maintain such enormous branch... Possibly because littlefs was declared "stable" too early and now @geky is struggling for backwards compatibility and migration path - it would be v2.0 already, while the system is not really usable on anything which has more than a couple of MB of capacity.

rbphilip commented 5 years ago

I'm wanting something that will be robust against power failures and deal with 128MB of SPI-connected NOR flash.

LittleFS V1 seems to work well enough, but I've not been able to exercise it extensively. is V1 actually stable & reasonably bug free, or ?

FreddieChopin commented 5 years ago

128MB of SPI-connected NOR flash

If you fill this chip so that it has like 10% of capacity used then any scan for free blocks for allocator (at least the first scan after mount, even if your lookahead covers whole chip) will take almost forever, as littlefs will actually read EVERYTHING that you have stored in this memory... That is why I'm nagging @geky so much (in many different issue reports, for example #75, but not only) for implementing a different allocator - current one (which does the scan when mounted and when out of free blocks) is just not scalable for anything bigger than a couple of MB.

Apart from that, littlefs works fine for me, but I did not test it that much too - mostly because:

a filesystem for max 1 or 2 MB of data on a 32 GB SD card is just not such a great idea (;
the project where I plan to use it is currently delayed due to other - more pressing - development.

rbphilip commented 5 years ago

That's good to know. Thanks for the heads-up. I guess it's time to look at commercial software.

geky commented 5 years ago

Hello! I guess this is a good issue to end my drought on :)

he's probably extremely busy with something else, as there is not much going on with littlefs since ~5-6 months.

Sorry about that. I had to step away for a bit. Honestly I was getting a bit overwhelmed.

Unfortunately, in my private opinion the whole v2 pull-request is an example of "how not to develop software". A branch which brings ~150 commits to a repo with ~170 commits (almost ~100% increase), bringing dozens of new features (most of them not very related to others), being developed and rebased for over a year - this just cannot end well because of the amount of work required to maintain such enormous branch...

I'm taking this as a challenge :)

Everything is almost good to go. Way late, but it is what it is. Also v1 is currently frozen, which may be turning other people away, but at least maintaining the split between v1 and v2 is tractable.

Possibly because littlefs was declared "stable" too early and now @geky is struggling for backwards compatibility and migration path

You're probably right. But, I don't think I'd be here otherwise.

That's good to know. Thanks for the heads-up. I guess it's time to look at commercial software.

I'm curious, what is the prevailing commercial filesystem out there?

You mentioned it up there somewhere: support is absolutely the reason to pay for commercial software. I'm personally a big fan of the free-software, paid-support model, though unfortunately I can't provide it by myself. Mbed does have paid support, but it looks pricier than what you have.

geky commented 5 years ago

If you fill this chip so that it has like 10% of capacity used then any scan for free blocks for allocator (at least the first scan after mount, even if your lookahead covers whole chip) will take almost forever, as littlefs will actually read EVERYTHING that you have stored in this memory... That is why I'm nagging @geky so much (in many different issue reports, for example #75, but not only) for implementing a different allocator - current one (which does the scan when mounted and when out of free blocks) is just not scalable for anything bigger than a couple of MB.

The curious thing for me is that so far the allocator issues raised haven't been that bad. I don't know why, maybe it's because most users are on NOR/NAND/internal and have a relatively small amount of storage?

First thing. Don't get me wrong, the allocator needs a lot of work. But, this has been the course of events:

v1 = proof of concept
v2 = small-scale improvements
v3 = large-scale improvements

There has been a lot of pressure for better support at the small scale. This primarily impacts internal flash, though there's also been a lot of improvements for NAND flash.

Unfortunately this has pushed large-scale improvements back quite a bit (allocator and COW data structure).

For NOR the allocator is perfectly fine. SD/eMMC has an unfortunate limitation in that block reads are very large and slow. But for NOR and NAND, traversing the tree is very efficient.

I do have plans to improve this: 1. allocator improvements, use a segment list, persist lookahead, 2. file COW structure changes.

Unfortunately, all improvements are limited by how fast I can respond to everything.

FreddieChopin commented 5 years ago

Good to have you back and good to hear that v2 is close to being released. Bad that this issue was completely sidetracked (I partly blame myself...) (;

geky commented 5 years ago

Bypassing cache is a problem for cases which require aligned buffers #158

Right! The original reason for this issue...

I wondered about this cache bypass myself just yesterday. My underlying hardware doesn't care, but I was surprised to get a programming request for 4KB when .prog_size was set to 128. Fortunately my underlying I/O breaks up writes & reads to multiple appropriately sized physical I/Os .

I'm guessing that the cache bypass is for improving performance & reducing copying of data and that's sometimes valid.

Right, so the purpose of the cache bypass is performance.

If a block device can write 4 KB efficiently, why not ask it to? littlefs will ask the block device to program any multiple of the prog size, though it will be provided in perfect multiples. This can lead to performance improvements if the block device supports burst writes in large blocks. If it doesn't, well, worst case the block device can split a large write into multiple small writes.

alignment is a different, and a bit concerning, issue.

I need to dig into this more, but what's saying that the cache's are aligned? malloc providing the caches will hopefully be aligned to roughly 64 bits, but if the user provides the caches then what?

We may need something different to insure alignment.

geky commented 5 years ago

Good to have you back and good to hear that v2 is close to being released. Bad that this issue was completely sidetracked (I partly blame myself...) (;

Good to be back :) Thanks for the responses to issues in my absence.

FreddieChopin commented 5 years ago

I need to dig into this more, but what's saying that the cache's are aligned? malloc providing the caches will hopefully be aligned to roughly 64 bits, but if the user provides the caches then what?

Well, for my case littlefs is completely wrapped in another class and the caches are provided by this class internally. They are allocated with malloc() and aligned manually. When I need 512 bytes of cache which should be aligned to 16 bytes, but I know that malloc() provides data which is aligned to 8 bytes, I allocate 512 + margin = 512 + (required-alignment - malloc-alignment) = 512 + (16 - 8) = 520 bytes. With such larger buffer I can then properly align it manually.

https://github.com/DISTORTEC/distortos/blob/master/source/FileSystem/littlefs/LittlefsFileSystem.cpp#L252

The option to leave this special case to user is the simplest one, because most likely the issue does not affect that many users. Solving that within littlefs is a bit more complex, but possible - it would require following steps:

User would have to set required alignment in lfs_config,
littlefs would need a way to know malloc() alignment for all supported compilers. In case of gcc this is __BIGGEST_ALIGNMENT__, for other compilers there is probably something similar, otherwise assume that it is the size of double.
During initialization littlefs would have to verify user-provided buffers or align manually buffers which were allocated dynamically. To support this later case, lfs_t and lfs_file_t would have to be extended with more pointers - along with the pointers to aligned caches, you also have to store the pointers to allocated (unaligned) buffers (to use with matching free() calls) (or at least store the offsets which were added to the allocated pointers, this would be smaller, just one byte per buffer is enough).
All code paths which bypass the cache need to extend the condition with the check for alignment.

Variant of the above - alignment is configured globally (in a #define), which allows special code paths and additional fields in lfs_t and lfs_file_t to be compiled only optionally. This requires some sort of __BIGGEST_ALIGNMENT__ to be available, sizeof(double) can no longer be used for this purpose.

Let me know if you prefer this solution, I can try to provide a pull-request for that too.

rbphilip commented 5 years ago

I'm curious, what is the prevailing commercial filesystem out there?

You mentioned it up there somewhere: support is absolutely the reason to pay for commercial software. I'm personally a big fan of the free-software, paid-support model, though unfortunately I can't provide it by myself. Mbed does have paid support, but it looks pricier than what you have.

I'm unsure if it's the prevailing software, but HCC (in Hungary, I believe) sell a "safe file system". I know people at an RTOS company that sell it to their customers and have had no issues. I'm in the process of porting it to my customer's hardware for testing.

MBED isn't very interesting for a couple of reasons. Cost, for one, but also we're already using FreeRTOS with good success.

apmorton commented 5 years ago

Just the off topic comments of a passer by.

I have a fair amount of experience using the HCC USB host stack. Our company paid somewhere north of $12k for the stack, ports, and one or two class drivers.

If their USB stack is anything to judge their other products from, I would personally avoid them.

At least for the USB hardware we are using (based on the synopsys usb ip core) their implementation has several glaring issues that caused us to spend way more time than I would like to admit fixing their own code.

10/10 would not buy again

geky commented 5 years ago

The option to leave this special case to user is the simplest one, because most likely the issue does not affect that many users.

Agreed. I see your points, there's also the fact that each config option has a cost.

This is convenient timing as this also cropped up for another user outside of GitHub. Turns out QSPI on the NRF5840 needs to be RAM backed and 4-byte aligned, both of which were being invalidated by the pass through.

I'm hesitant as this sort of things seems out of bounds for the block device API. This can be handled in the block device with a small stack-allocated RAM buffer, albeit at a performance penalty. But it is so much easier for users to just remove the pass through for these cases. And this is an MCU library after all.

@FreddieChopin Let's go with your PR, thanks for putting it up!

geky commented 5 years ago

@j3hill, FYI

FreddieChopin commented 5 years ago

This can be handled in the block device with a small stack-allocated RAM buffer, albeit at a performance penalty.

Depending on the driver this buffer may not be so small (; My main issue with such additional buffer - just for this specific case - is that it would be at least a second layer of buffering or even the third one (if you wrap everything in a FILE*). As I'm already providing buffers manually to littlefs - properly sized and aligned to avoid this problem - the pass-through code just defeats my efforts completely (; If RAM would not be such a big concern on MCUs then I would just provide internal buffering in the driver and be done with that, but throwing large stack buffers everywhere, while it's enough to just align another buffer differently, seems like a waste of memory.

Also stack-based buffers increase RAM requirements of all threads which may potentially hit that code path, which may make a single driver-owned buffer a better approach. But then you waste this RAM no matter what, even if it is never used.

Maybe a more flexible - but still simple - approach would be better for you? The bypassing code could be guarded either by a #define or with a configuration flag for lfs? This way the user could decide what option is better.

Or go with a solution that deals with required alignment and still has the ability to bypass caches? It is the most complex solution to this particular problem, but this doesn't mean it is itself very complex - deleting ~10 lines of code from one place is simpler than adding a few variables and additional lines in multiple places (;

Generally I'm as hesitant as you - there's no one solution that fixes all issues and is significantly better than other options - each one has some trade-offs /; I'm open to discussing the matter, as maybe I'm overthinking this and removing the pass-through is not the best approach here, I don't know...

geky commented 5 years ago

Maybe a more flexible - but still simple - approach would be better for you? The bypassing code could be guarded either by a #define or with a configuration flag for lfs? This way the user could decide what option is better.

I don't know if that's worth it, each #define multiplies the number of configurations that we should be testing and introduces a set of potential bugs.

Or go with a solution that deals with required alignment and still has the ability to bypass caches? It is the most complex solution to this particular problem, but this doesn't mean it is itself very complex - deleting ~10 lines of code from one place is simpler than adding a few variables and additional lines in multiple places (;

Agreed, I think just removing the passthrough is the best solution for now. And we can always add it back later. :)

geky commented 5 years ago

Merged https://github.com/ARMmbed/littlefs/pull/160 just now, let us know if there continues to be an issue. 👍

apmorton commented 5 years ago

Would be interesting to see the performance difference pre/post #160

How difficult is it to rerun the measurements and produce graphs like you did in the v2 alpha PR?

I only skimmed the code, but if I am reading this correctly there could be a non-trivial performance difference for large sequential reads.

In particular, if your hardware doesn't require aligned dma and your flash device has a command set that can read at arbitrary positions and stream arbitrary lengths of data (spi nor flash for example) with the read pass through I will see read calls down to my block device driver that are block size in length, meaning my dma hardware can take care of reads in large chunks directly into the application buffer.

Removing the pass through changes this scenario significantly for the worse. Reads passed down to the block driver will now always be limited to cache size, and large sequential reads now get split up into cache size chunks that must be copied by the cpu using memcpy from the cache to the application buffer.

For small files or small reads I would imagine the performance won't change much, since the bypass conditions would never be met anyway.

For reads that are at least block size in length the performance will be significantly worse. Depending on the discrepancy between your cache size and block size, maybe even several times worse.

For example, in my application I have a block size of 4k and a cache size of 512. One of my primary operations when reading from the filesystem is loading large files into an aligned memory buffer (files range between 4k and 64k). I basically open the file, call size, and then read the entire file in one go. With the change in #160 I will see 8 times as many reads to the block device, with dead time between each spi transaction while the cpu copies the data from the cache before starting the next block device read.

FreddieChopin commented 5 years ago

@apmorton, I share some of your concerns, but some of your assumptions are wrong (;

First of all, the scenario which you describe (reading a 4-64 kB file at once to a buffer) means you have a lot of RAM. Why not just set cache size to program size (block size) - or even bigger (does it make sense?) - and solve 99% of the problems described? The data will still have to be copied from cache to your buffers, but memcpy() is pretty quick anyway.
You would get sequential reads only if the payload data for files is actually written sequentially. I think (I may be wrong here) that anything above one block size will never be sequential due to the way littlefs stores data (there are back pointers which form the linked list of blocks). This actually means that even one block is not 100% sequential, as there are at least 4 bytes of metadata there.
Even if I'm wrong in item 2, in v2 of littlefs there are no sequential writes anyway, which means that your chance of getting data written sequentially into consecutive blocks (assuming this is possible and that there are no metadata) is just pure luck.

As I said, I know that removing the pass-through has performance implications, but I would say that with large-enough cache these are only the cost of memcpy() - you can get longer reads by just increasing the cache size. On the other hand manually dealing with alignment of buffers is a real PITA and sometimes really hard and inconvenient (when littlefs is hidden under a FILE*).

apmorton commented 5 years ago

This is a poor assumption. I have a lot of ram for the contents of this file (which is a virtual machine bytecode of sorts). It is by far the largest chunk of ram in my application, and actually makes the rest of the system rather resource constrained. Additionally, increasing the lfs cache size is a really poor return on investment. Ram usage for lfs cache is (2 + open_file_count) * cache_size. Having even a 512 byte cache consumes 1536 bytes effectively.
You are partially correct. The first data block in a file has no metadata header (because it has no blocks to point to). As I mentioned, you could previously read up to a block_size amount of file data in one block device operation. (although this appears to have been broken, or just changed, at some point in the v2 cycle - this was definitely how it worked in v1).

I actually found a bug in lfs_bd_read while looking into this further: #167 Before #160 this bug was mostly hidden by the cache bypass path if your read_size was 1, 2, or 4.

After looking through the code more, I am pretty convinced that removing the read cache bypass is going to change performance characteristics.

There are a whole lot of cases where metadata operations that read a single tag at a time (following ctz for example) would previously bypass the read cache that now will not. Those operations using the read cache is not really the problem though - its that those operations effectively clear the read cache.

You can actually see the effects of that here. On the left is pre #160 On the right hand side is after #160 (with the fix in #167).

The extra reads on the right occur because the small reads right before clobbered the read cache. On the left all the small reads after the 512 byte read at offset 516 skip the read cache, leaving it in tact.

@geky would definitely be interesting to see those performance graphs re-run after #160

FreddieChopin commented 5 years ago

@apmorton as I wrote previously (here or somewhere else [; ) these are the options to bring cache bypass again AND deal with drivers which require alignment:

make cache bypass conditional on a #define, configurable globally
make cache bypass conditional on a flag from lfs_config, configurable per-instance
implement alignment-aware cache bypass and other code in littlefs, while alignment is configurable via a new field in lfs_config.
as in 4, but alignment configurable globally via #define

I'm willing to implement any of these in a PR if @geky agrees and chooses one.

Maybe there are some other options (except "ignore those who need aligned buffers" [; )? Maybe cache buffers with individual sizes (different for read cache, different for program cache, different for each file) would be a good option to make your use-case more optimal? For example you could then set only one cache to 4 kB while leaving all other small. Or maybe it would be possible to improve caching somehow to avoid the problem you described (where small reads invalidate cache with a lot of data)?

As for the idea for caches with individual sizes, this may be quite tempting. I'm not 100% sure what each cache is for, but if so-called "read" and "prog" caches are only for dealing with metadata or directories, while per-file cache is where reads/writes of actual data happen, this would be pretty convenient to have per-file cache large and the other two can probably be pretty small. As for some files large cache makes no sense it would be convenient to have this size configurable for each file too. This solution however has a problem that it is not so useful when littlefs is wrapped into something else (like FILE*), as then it would not be possible to configure this per-file cache size easily...

It would be also nice if you could tell us what's the time difference of these extra reads you experience now. It's clearly visible that these reads are there, but maybe this whole thing now is just 1% slower than before? But maybe it's 50% slower - I don't know... All I see are 9 extra reads listed among ~170 other (I assume that in the left pane you have the cursor in the last line and this line is 169). This would be about 5%, but I still have no idea whether it translates to 1%, 5% or 50% time difference. I suppose that a case where you can read just 1 byte from the device is pretty special too (;

Generally I understand that without cache bypass littlefs is slower but at least it still works correctly, which is not the case for drivers which really do require buffer alignment (;

apmorton commented 5 years ago

The image I showed in my previous post is just mounting the filesystem and calling mkdir on a directory that already exists to ensure it is there - the actual files I showed go on for thousands of lines, but the diff gets really hard to follow later on.

I agree this should likely be a configurable option - either at compile time or in lfs_config (although compile time is almost always preferable). Just an assumption, but if I had to wager I would say overwhelmingly the most common use case is a single lfs instance in an application. And in the cases where that isn't true, if they have different alignment constraints you can always use the larger of the two for both. Making alignment configurable in lfs_config would probably complicate the implementation more than its worth.

I haven't had time to do any in depth comparisons of my whole application so far - but for sure without the fixes I posted in #167 the state of things after #160 is significantly less efficient in some cases. The cache bypass path hid the majority of cases where the bug in #167 would be triggered.

Any read that happens with a small hint (almost all reads of metadata during directory/ctz traversal happen with a small hint) will now have a size determined by the smaller of:

offset in current block + hint, rounded up to a multiple of read_size
cache_size
block_size - offset in current block rounded down to a multiple of read_size

This results in unnecessarily large reads that get larger the further into a block you go - for metadata blocks this ends up being super wasteful. You can see this behavior here (again just mounting fs and mkdir on existing directory):

With the fix in #167 I would expect the main performance difference to be caused by dead time between spi transfers. The extra memcpy is not ideal, and its certainly a bummer to waste more memory bandwidth (especially when you already know your cpu is frequently stalling due memory contention with dma hardware), but its not going to be the source of orders of magnitude difference I don't think.

geky commented 5 years ago

Sorry about my absence again, life got in the way, hopefully a month wasn't too long...

@apmorton you're right a proper performance comparison would have been valuable. I mostly wanted to unblock @FreddieChopin before I went missing for an unknown amount of time.

My first priority is to work through bugs that have been reported, so sorry if this I'm slow to move this one forward.

implement alignment-aware cache bypass and other code in littlefs, while alignment is configurable via a new field in lfs_config.

@FreddieChopin, if you're still interested in this, this one would be my preference (could I suggest "cache_alignment"?). Though do we have a guess at how much code size it would cost?

I've been avoiding compile time configuration because of the risk of having too many combinations to test effectively. Also being able to run multiple littlefs instances on a single device is a plus.

Making alignment configurable in lfs_config would probably complicate the implementation more than its worth.

Would this be more complicated than #defines?

This really depends on the code cost vs runtime cost.

Also a quick note, the priority of this repo is not to be the most efficient filesystem, not even the most efficient littlefs filesystem. The main priority is usability.

Compared to disk-level design, the software driver can change very easily.

If we see forks that are slimmed down for maximum size/speed, you can color me happy.

As for the idea for caches with individual sizes, this may be quite tempting. I'm not 100% sure what each cache is for, but if so-called "read" and "prog" caches are only for dealing with metadata or directories

I did look into cache-specific sizes, but it became more complicated than I thought would be worth it.

The read/prog caches are general purpose and used for more than just metadata. For instance file flushing using the read cache as a temporary file cache to scan the branch of the file being written out: https://github.com/ARMmbed/littlefs/blob/f35fb8c14866a4a4677756f6dbeca78f8a9b4001/lfs.c#L2497-L2504

But may be worth another look, not sure.

apmorton commented 5 years ago

At first glance it seems like you could balance the compile time and runtime concerns with something like this in lfs_utils.h

#define LFS_ALIGNMENT(lfs) lfs->config->cache_alignment

and similar for many other configurables.

You wouldn't be able to entirely get rid of the runtime cost using this mechanism, but for people who don't need these features, replacing the macros to return constants would potentially allow constant folding to collapse some of the work.

geky commented 5 years ago

Oh that is a good idea. Though unfortunately there would still be the RAM cost for the config struct. Would be interesting how much code size that saves.

josesimoes commented 6 months ago

Same concern here! For .NET nanoFramework I'm migrating from SPIFFs to littlefs. The ChibiOS SPI drivers for STM32 port use DMA, which requires buffers with 32 bytes aligment. I'm assigning the read_buffer and prog_buffer to statically allocated (and aligned) buffers.

Storage are various SPI Serial Flash Memory chips (like AT25SF641 and W25Q128, for example). Some with regular SPI others with QSPI,

I'm getting corruption errors and wrong data reads... I suppose this would be a side effect of this misalignment.

geky commented 6 months ago

A quick workaround would be an additional buffer in your driver that you copy data to/from before sending it to DMA. This buffer could at least be shared between read and prog operations.

You could also check the alignment on reads/progs to opportunistically avoid the extra copy.

We could eventually add some sort of read_align/prog_align config to limit reads/progs to when the buffer is aligned, but this wouldn't cover other niche requirements such as limited DMA addressability, MMU-bypassing DMA, etc. It would also be highly likely to confuse users about the alignment on-disk vs in-RAM.

josesimoes commented 6 months ago

Thanks for the quick reply and suggestion! So... you're suggesting that I use only one buffer (properly aligned for DMA access) to deal with this. Instead of the two I'm using right now. I can certainly try that and I'll report back.

Now, about checking the aligment on read/prog, how do you suggest I can do that?

geky commented 6 months ago

Well, maybe three. littlefs will still need buffers for its internal pcache/rcache logic. Though you can leave those up to malloc if you have malloc in your system.

Now, about checking the aligment on read/prog, how do you suggest I can do that?

If you have any pointer, you can check if it's aligned by casting to an uint:

__attribute__((aligned(32)))
uint8_t aligned_buffer[1*32]; // this can be increased for a RAM/performance tradeoff

int my_bd_prog(const struct lfs_config *c, lfs_block_t block,
        lfs_off_t off, const void *buffer, lfs_size_t size) {
    // 32-byte aligned buffer?
    if (((uintptr_t)buffer) % 32 == 0) {
        // pass directly
        return bd_prog(c, block, off, buffer, size);
    } else {
        // copy into aligned buffer
        while (size > 0) {
            lfs_size_t d = lfs_min(size, sizeof(aligned_buffer));
            memcpy(aligned_buffer, buffer, d);
            bd_prog(c, block, off, aligned_buffer, d);

            off += d;
            buffer += d;
            size -= d;
        }   
    }
}

You may want a larger buffer so more is sent in a single transaction, but this would come at a RAM tradeoff.

You can also still provide explicit pcache/rcache buffers that are aligned so at least you know non-cache-bypassing reads/progs will avoid the extra copy.

Unfortunately, since the storage read/prog disk addresses probably also need to be aligned, you can't quite do the slow head/tail + fast body trick common in other alignment-required situations (SIMD)...

rbphilip commented 6 months ago

Amusing how this just popped Into my email literals years after my last use of littlefs. Still being used by the customer as far as I know.

geky commented 6 months ago

@rbphilip good to hear!

littlefs-project / littlefs

Bypassing cache is a problem for cases which require aligned buffers #158