superfly / litefs

FUSE-based file system for replicating SQLite databases across a cluster of machines
Apache License 2.0
3.78k stars 89 forks source link

File size changes causes OS page cache invalidation #149

Closed benbjohnson closed 1 year ago

benbjohnson commented 1 year ago

When running the repro program from @dangra in this comment, I'm seeing multi-second latency for read queries on the replica.

Update: The read in the repro program fetches every page in the database. With the current FUSE library, a change in the file size will invalidate the whole OS page cache which means that all pages must be refetched which is slow.

benbjohnson commented 1 year ago

Update: It looks like the FUSE layer is invalidating the whole database file from cache when it applies an LTX file so a SELECT COUNT(*) query ends up having to re-read the entire file from litefs which is slow.

I'm still working on a small, reproducible script. LiteFS was missing the fuse.OpenKeepCache on Create() and Open() although this issue still persists after adding that flag in.


@tv42 Do you have any ideas off the top of your head why this might be occurring? I'm seeing InvalidateNode calls in the debug output like this:

74FD7445AF470FB062707E8A [r]: => InvalidateNode 0x2 Off:0 Size:4096
74FD7445AF470FB062707E8A [r]: => InvalidateNode 0x2 Off:497082368 Size:4096
74FD7445AF470FB062707E8A [r]: => InvalidateNode 0x2 Off:497086464 Size:4096

Node 0x2 is the database file. I'm also invalidating a different file (the shared memory, or SHM file) with the node id of 0x4:

74FD7445AF470FB062707E8A [r]: => InvalidateNode 0x4 Off:0 Size:-1

I wouldn't think that would affect the OS page cache of node 0x2 though.

tv42 commented 1 year ago

Does the file change size? That currently triggers data invalidation:

        if (oldsize != attr->size) {
            truncate_pagecache(inode, attr->size);
            if (!fc->explicit_inval_data)
                inval = true;

That explicit_inval_data is set via FUSE_EXPLICIT_INVAL_DATA which the library does not yet have support for (it's protocol v7.30, we're still at 7.17).

benbjohnson commented 1 year ago

@tv42 Ah, that would probably be it. It's a script that just inserts data so the database would keep growing and change size.

How difficult is it to support FUSE_EXPLICIT_INVAL_DATA? I was looking at fuse_kernel.h and it shows FUSE_WRITEBACK_CACHE at 7.23 but I think bazil.org/fuse supports that, right? Is support for FUSE_EXPLICIT_INVAL_DATA something that could be added piecemeal or does everything need to be supported between 7.17 and 7.30?

tv42 commented 1 year ago

I'll look into it.

tv42 commented 1 year ago

Status update: I'm at FUSE protocol v7.19 now, FUSE_EXPLICIT_INVAL_DATA is in v7.30, many of the changes in between tell the kernel to send new kinds of messages toward userspace so I can't just pick and choose and "skip the queue". (I might be able to choose to not handle them, but I want to understand the consequences and decide that case by case.)

FUSE_WRITEBACK_CACHE is not supported at this time, we're in writethrough mode by default. In writeback, writes are sent to the FUSE server only lazily. See https://www.kernel.org/doc/Documentation/filesystems/fuse-io.rst

benbjohnson commented 1 year ago

@tv42 Thanks for researching that, Tv. It'll eventually be a higher priority issue for us but we have a few other things to get done on LiteFS first.

For anyone reading this in the future, this mainly affects databases that:

  1. Are large-ish
  2. Are frequently growing via INSERT commands
  3. Perform large scans of tables.
tv42 commented 1 year ago

Fixed! See MountOption fuse.ExplicitInvalidateData in https://github.com/bazil/fuse/commit/b2cd994c4fa7b3c1c9819cd139a8a46d7af2e175

Caveat: I haven't tested it yet.

benbjohnson commented 1 year ago

🎉 🙌 🎉 🙌 🎉 🙌 🎉 🙌

benbjohnson commented 1 year ago

@tv42 I'll give a try this week. Thank you!