gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.54k stars 1.07k forks source link

FUSE mount cache invalidation does not work with stat-prefetch disabled. #4281

Open philipspencer opened 6 months ago

philipspencer commented 6 months ago

Description of problem:

FUSE-mount cache invalidation does not work when stat-prefetch is disabled. And, in particular, it does not work on a volume to which the settings group "db-workload" is applied.

It seems that fuse-bridge, although it carefully creates upcall notifications which get propagated through the translator graph, never actually listens for such notifications. Therefore, it is reliant on other optional xlators such as stat-prefetch to do the actual work of invalidating an inode when a cache invalidation notification is received.

If one has a volume holding data that is being accessed by several different hosts, and one disables stat-prefetch (such as by applying the "group db-workload" setting to it), the cache on one host is never invalidated when changes are made on a different host.

The exact command to reproduce the issue: Create a volume "testing" with default options. Then:

gluster volume set testing features.cache-invalidation on
gluster volume set testing performance.cache-invalidation on
gluster volume set testing stat-prefetch off

Update: See comment https://github.com/gluster/glusterfs/issues/4281#issuecomment-1852503550 for a more concise test case that works using two mounts of the same volume on the same host.

Mount the volume under $MOUNT on two client hosts, with --fopen-keep-cache.

On the first host, run

echo OLD > $MOUNT/t
cat $MOUNT/t

On the second host, run

echo NEW > $MOUNT/t

On the first host, run

cat $MOUNT/t

and it displays OLD not NEW.

Update: it has been pointed out that, contrary to what is specified in man.glusterfs, specifying --fopen-keep-cache causes --auto-invalidation to no longer default to "on", so it is necessary to explicitly add "--auto-invalidation=on" to the mount options. And in that case, auto-invalidation will mask the problem by causing the above simple test case to work.

Here is a revised example that will reproduce the problem even when auto-invalidation is explicitly turned on:

On the first host, run

 touch $MOUNT/lock; touch $MOUNT/lock2; flock $MOUNT/lock sh -c 'echo OLD > $MOUNT/t;  sleep 5; cat $MOUNT/t; stat $MOUNT/t > /dev/null'; flock $MOUNT/lock2 cat $MOUNT/t

On the second host, while the first host's sleep is running, run

flock $MOUNT/lock2 flock $MOUNT/lock sh -c 'echo NEW > $MOUNT/t' 

The full output of the command that failed:

host1$ cat $MOUNT/t
OLD

Or, with the updated test case that demonstrates the problem even when auto-invalidation is turned on:

OLD
OLD

Expected results:

host1$ cat $MOUNT/t
NEW

(sorry ... typos in previous version of this issue report ... fixed now).

Or, in the revised test case that reproduces the problem even when auto-invalidation is turned on:

OLD
NEW

If one does

gluster volume set testing stat-prefetch on

and repeats the tests, the expected result is achieved.

The following patch seems to work on preliminary testing to allow cache invalidation to work even when stat-prefetch is disabled:

diff --git a/xlators/mount/fuse/src/fuse-bridge.c b/xlators/mount/fuse/src/fuse-bridge.c
index 2dc9b4f429..4bdd8f30cf 100644
--- a/xlators/mount/fuse/src/fuse-bridge.c
+++ b/xlators/mount/fuse/src/fuse-bridge.c
@@ -18,6 +18,7 @@
 #include <glusterfs/syscall.h>
 #include <glusterfs/timespec.h>
 #include <glusterfs/async.h>
+#include <glusterfs/upcall-utils.h>

 #ifdef __NetBSD__
 #undef open /* in perfuse.h, pulled from mount-gluster-compat.h */
@@ -6505,6 +6506,8 @@ notify(xlator_t *this, int32_t event, void *data, ...)
     gf_boolean_t event_graph = _gf_true;
     glusterfs_graph_t *graph = NULL;
     struct pollfd pfd = {0};
+    struct gf_upcall *up_req = NULL;
+    inode_t *inode = NULL;

     private = this->private;

@@ -6594,6 +6597,20 @@ notify(xlator_t *this, int32_t event, void *data, ...)
             break;
         }

+        case GF_EVENT_UPCALL: {
+            up_req = (struct gf_upcall *) data;
+            if (up_req->event_type == GF_UPCALL_CACHE_INVALIDATION &&
+                     private->active_subvol &&
+                     private->active_subvol->graph &&
+                     private->active_subvol->graph->top) {
+                inode = inode_find(
+                     ((xlator_t *) private->active_subvol->graph->top)->itable,
+                     up_req->gfid);
+                if (inode) inode_invalidate(inode);
+            }
+            break;
+        }
+
         default:
             /* Set the event_graph to false so that event
                debug msg would not try to access invalid graph->id

I can reformat and submit as an actual patch / merge request if desired.

mohit84 commented 6 months ago

If you want to use fopen-keep-cache you have to enable auto-invalidation also then you would not be able to face this issue.

philipspencer commented 6 months ago

No, the underlying issue is still seen even when auto-invalidation is on.

There's a separate bug in the documentation: auto-invalidation is supposed to be on by default according to mount.glusterfs. Therefore, explicitly enabling it should make no difference. That's why I built a simple test case without explicitly turning it on. But in fact, the default is only "on" if fopen-keep-cache is unspecified, so that should be fixed in glusterfs man page.

It is true that explicitly enabling auto-invalidation in the mount command is enough to prevent my simple test case from reproducing the problem, because FUSE does a stat call and sees that the file has been modified and then invalidates the cache itself without GlusterFS's help.

However, if you change the test case so that everything happens while the file's attributes are still held in the FUSE cache, then the problem is reproducible even with auto-invalidation turned on. Here is such an example:

On the first host, run

 touch $MOUNT/lock; touch $MOUNT/lock2; flock $MOUNT/lock sh -c 'echo OLD > $MOUNT/t;  sleep 5; cat $MOUNT/t; stat $MOUNT/t > /dev/null'; flock $MOUNT/lock2 cat $MOUNT/t

Then, on the second host, while the sleep command is running on the first host, run

flock $MOUNT/lock2 flock $MOUNT/lock sh -c 'echo NEW > $MOUNT/t' 

The sequencing of the locks causes the following operations to happen:

The expected output is

OLD
NEW

but the actual output is

OLD
OLD

If you repeat this after enabling stat-prefetch on the volume, the expected output is correctly produced.

What's happening here is that, even though auto-invalidation is turned on, the FUSE client has no way to observe that the file has been written to on the remote host, because the only mechanism for it to see that is by looking for stat changes, but stats are cached (for 1 second by default), and turning off that cache (by mounting with attribute-timeout=0) will severely degrade performance.

When stat-prefetch is enabled, it sees the upcall generated by the remote host and invalidates the inode, which causes the fuse mount client to invoke the "on invalidation" callback and tells FUSE to invalidate its cache.

But when stat-prefetch is disabled, and no other translator is enabled which will do that invalidation, the FUSE mount client will not see that the file has been changed. FUSE itself will see it, and invalidate its cache, only when it sees a stat change, but this won't happen until its attribute cache timeout has expired.

mohit84 commented 6 months ago

You are correct It is a documentation bug. Do you think does it make sense stat-prefetch(md-cache) is disabled but cache-invalidation is on. The md-cache is main consumer of the upcall cache event notification , i don;t think it is a good idea to enable cache-invalidation but disable md-cache.

mohit84 commented 6 months ago

Can you share the volume info output "gluster v info" ?

philipspencer commented 6 months ago

Can you share the volume info output "gluster v info" ?

It is in the bug report: all default settings, except

Options reconfigured:
features.cache-invalidation: on
performance.cache-invalidation: on
performance.stat-prefetch: off (on for the working case).

Plus gluster volume info shows three additional settings that it always shows as "reconfigured" even when they are the default:

storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on

Besides, as you point out, since md-cache/stat-prefetch is the main consumer of the cache invalidation upcall, logically there is no way for cache invalidation to work if it is disabled. FUSE cache auto-invalidation relies on detecting a change to the file's attributes, which it has no way of detecting while those attributes are cached. And there is nothing else that will invalidate the FUSE cache.

Also, the bug report gives a reproducible step-by-step procedure for reproducing the problem on any gluster installation (start by creating a volume with default options, add the settings above, mount on two hosts with the options specified above, and run the commands specified above), so no further information should be required unless someone tries that step-by-step procedure on a system and gets different results.

Do you think does it make sense stat-prefetch(md-cache) is disabled but cache-invalidation is on. ... i don;t think it is a good idea to enable cache-invalidation but disable md-cache.

Disabling stat-prefetch is done by gluster volume set $VOLNAME group db-workload, which is a natural thing to reach for when performance tuning a volume to hold shared database files. And, of course, shared database files are where data consistency is crucial. We ran into this issue because of several SQLite database files getting corrupted when updated from different hosts -- even when the updates where sequentially coordinated by means of a lock file. Only if the updates were more than one second apart in time (to allow the FUSE mount's attribute/entry caches to expire) were the operations safe.

After undoing all the setting changes made by group db-workload, the databases passed all our concurrency testing even when updated simultaneously from different hosts. Not only that, they also worked flawlessly even when the concurrent updates were not coordinated by any explicit locking mechanism; SQLite's own internal locking was sufficient. And this, clearly, is very desirable.

We still wanted to performance boost provided by group db-workload, so we turned on the various options until we discovered that "stat-prefetch=off" was the one that caused the corruption. We tested applying all the rest of the group db-workload options together except for stat-prefetch=off, and again everything worked flawlessly with no database corruption. As soon as we set stat-prefetch=off to complete the group db-workload options, the corruption occurred again. That's when I dug into the source code and found the issue being reported here.

Personally I think that, because the upcall is being generated by the FUSE client on one host, it should be responded to by the FUSE client on the other host, not solely my the optional md-cache translator.

If, however, it is decided that md-cache should be considered mandatory for cache invalidation to work, then:

  1. Documentation should be added to the description of features.cache-invalidation/performance.cache-invalidation explaining that they will not work unless stat-prefetch is also enabled, and to performance.stat-prefetch stating that, if turned off, FUSE cache invalidation will not work; and
  2. The "stat-prefetch=off" setting should be removed from group db-workload. Or, documentation should be added to group db-workload saying that it is only appropriate for single-host mounts (such holding PostgreSQL or MySQL files for a single database server, where there is guaranteed to be a pause before another process takes over serving from those same files, such as during a container restart) and is not appropriate for shared-file database systems like SQLite. Perhaps a separate group shared-db-workload could be created to optimize tuning for a volume holding write-heavy database files that will be updated from several mount points concurrently.
mohit84 commented 6 months ago

I want to check volume topology, i tried to reproduce an issue on 2x3 volume after configured similar volume options but i was not able to reproduce an issue.

philipspencer commented 6 months ago

It was a simple 2-replica, 1-thin-arbiter volume. The topology is irrelevant though, unless you've got something else in the client graph that listens for cache invalidation upcalls.

Also, my volume server is still running gluster 9.x so, for greater clarity:

Here is a test cast that can run on a standalone machine with the latest glusterfs 11.1 (or even from a development snapshot if desired).

This test case has a single brick with no replicas and two mounts onto the same machine. As a result, no need to mess with lock files to get the timing right. Just run the following commands on a spare machine:

  1. Install glusterfs server and client.
  2. Start glusterd.
  3. Create a partition or logical volume and mount it on /gluster-test
  4. Run the following commands as root or under sudo:
    $ MYADDRESS=`hostname -I | sed 's/ //'`  # Or any other way of getting IP address or hostname for the test machine.
    $ mkdir /gluster-test/brick
    $ gluster volume create testing $MYADDRESS:/gluster-test/brick
    $ gluster volume set testing features.cache-invalidation on
    $ gluster volume set testing performance.cache-invalidation on
    $ gluster volume start testing
    $ mkdir /tmp/mnt1
    $ mkdir /tmp/mnt2
    $ mount -t glusterfs $MYADDRESS:testing -o fopen-keep-cache,auto-invalidation=on /tmp/mnt1
    $ mount -t glusterfs $MYADDRESS:testing -o fopen-keep-cache,auto-invalidation=on /tmp/mnt2
    $ gluster volume set testing performance.stat-prefetch off
    $ echo OLD > /tmp/mnt1/t; cat /tmp/mnt1/t; stat /tmp/mnt1/t > /dev/null; echo NEW > /tmp/mnt2/t; cat /tmp/mnt1/t
    OLD
    OLD    <--- this is the wrong result
    $ gluster volume set testing performance.stat-prefetch on
    $ echo OLD > /tmp/mnt1/t; cat /tmp/mnt1/t; stat /tmp/mnt1/t > /dev/null; echo NEW > /tmp/mnt2/t; cat /tmp/mnt1/t
    OLD
    NEW    <--- this is the right result

    Do you get something different when you run exactly those commands?

philipspencer commented 6 months ago

Here's a more concise replicator, again assuming the same volume is mounted in two places (/tmp/mnt1 and /tmp/mnt2) on the same host, and explicitly using /bin/stat in case your shell has a built-in stat that isn't triggering the same behaviour:

# gluster volume set testing stat-prefetch off
volume set: success
# echo WRONG > /tmp/mnt1/t; (cat; /bin/stat -) < /tmp/mnt1/t > /dev/null;  echo RIGHT > /tmp/mnt2/t; cat /tmp/mnt1/t
WRONG
# gluster volume set testing stat-prefetch on 
volume set: success
# echo WRONG > /tmp/mnt1/t; (cat; /bin/stat -) < /tmp/mnt1/t > /dev/null;  echo RIGHT > /tmp/mnt2/t; cat /tmp/mnt1/t
RIGHT
philipspencer commented 6 months ago

And here is a more sophisticated test that really tests whether or not cache invalidation is working by keeping the file open, so there is no effect from fopen-keep-cache (other than issue #4280 where cache invalidation is explicitly disabled when fopen-keep-cache is off):

$ echo WRONG > /tmp/mnt1/t; ( (cat; /bin/stat -)  > /dev/null;  echo RIGHT > /tmp/mnt2/t; cat <&3 ) 3</tmp/mnt1/t < /tmp/mnt1/t 

It should output RIGHT. However, if the filesystems are mounted with fopen-keep-cache=off, it fails and reports WRONG, thereby demonstrating issue #4280, and, if the volume does not have stat-prefetch on, it also fails and reports WRONG, thereby demonstrating this issue.

This is true regardless of the mount setting of auto-invalidation.

If stat-prefetch is disabled on the volume, this one-liner outputs WRONG for all combinations of fopen-keep-cache and auto-invalidation mount options.

if stat-prefetch is enabled on the volume, this one-liner outputs RIGHT if mounted with fopen-keep-cache=on or unspecified, WRONG if mounted with fopen-keep-cache-off, regardless of auto-invalidation mount option.