Open philipspencer opened 6 months ago
If you want to use fopen-keep-cache you have to enable auto-invalidation also then you would not be able to face this issue.
No, the underlying issue is still seen even when auto-invalidation is on.
There's a separate bug in the documentation: auto-invalidation is supposed to be on by default according to mount.glusterfs. Therefore, explicitly enabling it should make no difference. That's why I built a simple test case without explicitly turning it on. But in fact, the default is only "on" if fopen-keep-cache is unspecified, so that should be fixed in glusterfs man page.
It is true that explicitly enabling auto-invalidation in the mount command is enough to prevent my simple test case from reproducing the problem, because FUSE does a stat call and sees that the file has been modified and then invalidates the cache itself without GlusterFS's help.
However, if you change the test case so that everything happens while the file's attributes are still held in the FUSE cache, then the problem is reproducible even with auto-invalidation turned on. Here is such an example:
On the first host, run
touch $MOUNT/lock; touch $MOUNT/lock2; flock $MOUNT/lock sh -c 'echo OLD > $MOUNT/t; sleep 5; cat $MOUNT/t; stat $MOUNT/t > /dev/null'; flock $MOUNT/lock2 cat $MOUNT/t
Then, on the second host, while the sleep command is running on the first host, run
flock $MOUNT/lock2 flock $MOUNT/lock sh -c 'echo NEW > $MOUNT/t'
The sequencing of the locks causes the following operations to happen:
The expected output is
OLD
NEW
but the actual output is
OLD
OLD
If you repeat this after enabling stat-prefetch on the volume, the expected output is correctly produced.
What's happening here is that, even though auto-invalidation is turned on, the FUSE client has no way to observe that the file has been written to on the remote host, because the only mechanism for it to see that is by looking for stat changes, but stats are cached (for 1 second by default), and turning off that cache (by mounting with attribute-timeout=0) will severely degrade performance.
When stat-prefetch is enabled, it sees the upcall generated by the remote host and invalidates the inode, which causes the fuse mount client to invoke the "on invalidation" callback and tells FUSE to invalidate its cache.
But when stat-prefetch is disabled, and no other translator is enabled which will do that invalidation, the FUSE mount client will not see that the file has been changed. FUSE itself will see it, and invalidate its cache, only when it sees a stat change, but this won't happen until its attribute cache timeout has expired.
You are correct It is a documentation bug. Do you think does it make sense stat-prefetch(md-cache) is disabled but cache-invalidation is on. The md-cache is main consumer of the upcall cache event notification , i don;t think it is a good idea to enable cache-invalidation but disable md-cache.
Can you share the volume info output "gluster v info" ?
Can you share the volume info output "gluster v info" ?
It is in the bug report: all default settings, except
Options reconfigured:
features.cache-invalidation: on
performance.cache-invalidation: on
performance.stat-prefetch: off (on for the working case).
Plus gluster volume info shows three additional settings that it always shows as "reconfigured" even when they are the default:
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
Besides, as you point out, since md-cache/stat-prefetch is the main consumer of the cache invalidation upcall, logically there is no way for cache invalidation to work if it is disabled. FUSE cache auto-invalidation relies on detecting a change to the file's attributes, which it has no way of detecting while those attributes are cached. And there is nothing else that will invalidate the FUSE cache.
Also, the bug report gives a reproducible step-by-step procedure for reproducing the problem on any gluster installation (start by creating a volume with default options, add the settings above, mount on two hosts with the options specified above, and run the commands specified above), so no further information should be required unless someone tries that step-by-step procedure on a system and gets different results.
Do you think does it make sense stat-prefetch(md-cache) is disabled but cache-invalidation is on. ... i don;t think it is a good idea to enable cache-invalidation but disable md-cache.
Disabling stat-prefetch is done by gluster volume set $VOLNAME group db-workload
, which is a natural thing to reach for when performance tuning a volume to hold shared database files. And, of course, shared database files are where data consistency is crucial. We ran into this issue because of several SQLite database files getting corrupted when updated from different hosts -- even when the updates where sequentially coordinated by means of a lock file. Only if the updates were more than one second apart in time (to allow the FUSE mount's attribute/entry caches to expire) were the operations safe.
After undoing all the setting changes made by group db-workload, the databases passed all our concurrency testing even when updated simultaneously from different hosts. Not only that, they also worked flawlessly even when the concurrent updates were not coordinated by any explicit locking mechanism; SQLite's own internal locking was sufficient. And this, clearly, is very desirable.
We still wanted to performance boost provided by group db-workload, so we turned on the various options until we discovered that "stat-prefetch=off" was the one that caused the corruption. We tested applying all the rest of the group db-workload options together except for stat-prefetch=off, and again everything worked flawlessly with no database corruption. As soon as we set stat-prefetch=off to complete the group db-workload options, the corruption occurred again. That's when I dug into the source code and found the issue being reported here.
Personally I think that, because the upcall is being generated by the FUSE client on one host, it should be responded to by the FUSE client on the other host, not solely my the optional md-cache translator.
If, however, it is decided that md-cache should be considered mandatory for cache invalidation to work, then:
group db-workload
. Or, documentation should be added to group db-workload
saying that it is only appropriate for single-host mounts (such holding PostgreSQL or MySQL files for a single database server, where there is guaranteed to be a pause before another process takes over serving from those same files, such as during a container restart) and is not appropriate for shared-file database systems like SQLite. Perhaps a separate group shared-db-workload
could be created to optimize tuning for a volume holding write-heavy database files that will be updated from several mount points concurrently.I want to check volume topology, i tried to reproduce an issue on 2x3 volume after configured similar volume options but i was not able to reproduce an issue.
It was a simple 2-replica, 1-thin-arbiter volume. The topology is irrelevant though, unless you've got something else in the client graph that listens for cache invalidation upcalls.
Also, my volume server is still running gluster 9.x so, for greater clarity:
Here is a test cast that can run on a standalone machine with the latest glusterfs 11.1 (or even from a development snapshot if desired).
This test case has a single brick with no replicas and two mounts onto the same machine. As a result, no need to mess with lock files to get the timing right. Just run the following commands on a spare machine:
$ MYADDRESS=`hostname -I | sed 's/ //'` # Or any other way of getting IP address or hostname for the test machine.
$ mkdir /gluster-test/brick
$ gluster volume create testing $MYADDRESS:/gluster-test/brick
$ gluster volume set testing features.cache-invalidation on
$ gluster volume set testing performance.cache-invalidation on
$ gluster volume start testing
$ mkdir /tmp/mnt1
$ mkdir /tmp/mnt2
$ mount -t glusterfs $MYADDRESS:testing -o fopen-keep-cache,auto-invalidation=on /tmp/mnt1
$ mount -t glusterfs $MYADDRESS:testing -o fopen-keep-cache,auto-invalidation=on /tmp/mnt2
$ gluster volume set testing performance.stat-prefetch off
$ echo OLD > /tmp/mnt1/t; cat /tmp/mnt1/t; stat /tmp/mnt1/t > /dev/null; echo NEW > /tmp/mnt2/t; cat /tmp/mnt1/t
OLD
OLD <--- this is the wrong result
$ gluster volume set testing performance.stat-prefetch on
$ echo OLD > /tmp/mnt1/t; cat /tmp/mnt1/t; stat /tmp/mnt1/t > /dev/null; echo NEW > /tmp/mnt2/t; cat /tmp/mnt1/t
OLD
NEW <--- this is the right result
Do you get something different when you run exactly those commands?
Here's a more concise replicator, again assuming the same volume is mounted in two places (/tmp/mnt1 and /tmp/mnt2) on the same host, and explicitly using /bin/stat in case your shell has a built-in stat that isn't triggering the same behaviour:
# gluster volume set testing stat-prefetch off
volume set: success
# echo WRONG > /tmp/mnt1/t; (cat; /bin/stat -) < /tmp/mnt1/t > /dev/null; echo RIGHT > /tmp/mnt2/t; cat /tmp/mnt1/t
WRONG
# gluster volume set testing stat-prefetch on
volume set: success
# echo WRONG > /tmp/mnt1/t; (cat; /bin/stat -) < /tmp/mnt1/t > /dev/null; echo RIGHT > /tmp/mnt2/t; cat /tmp/mnt1/t
RIGHT
And here is a more sophisticated test that really tests whether or not cache invalidation is working by keeping the file open, so there is no effect from fopen-keep-cache (other than issue #4280 where cache invalidation is explicitly disabled when fopen-keep-cache is off):
$ echo WRONG > /tmp/mnt1/t; ( (cat; /bin/stat -) > /dev/null; echo RIGHT > /tmp/mnt2/t; cat <&3 ) 3</tmp/mnt1/t < /tmp/mnt1/t
It should output RIGHT. However, if the filesystems are mounted with fopen-keep-cache=off, it fails and reports WRONG, thereby demonstrating issue #4280, and, if the volume does not have stat-prefetch on, it also fails and reports WRONG, thereby demonstrating this issue.
This is true regardless of the mount setting of auto-invalidation.
If stat-prefetch is disabled on the volume, this one-liner outputs WRONG for all combinations of fopen-keep-cache and auto-invalidation mount options.
if stat-prefetch is enabled on the volume, this one-liner outputs RIGHT if mounted with fopen-keep-cache=on or unspecified, WRONG if mounted with fopen-keep-cache-off, regardless of auto-invalidation mount option.
Description of problem:
FUSE-mount cache invalidation does not work when stat-prefetch is disabled. And, in particular, it does not work on a volume to which the settings group "db-workload" is applied.
It seems that fuse-bridge, although it carefully creates upcall notifications which get propagated through the translator graph, never actually listens for such notifications. Therefore, it is reliant on other optional xlators such as stat-prefetch to do the actual work of invalidating an inode when a cache invalidation notification is received.
If one has a volume holding data that is being accessed by several different hosts, and one disables stat-prefetch (such as by applying the "group db-workload" setting to it), the cache on one host is never invalidated when changes are made on a different host.
The exact command to reproduce the issue: Create a volume "testing" with default options. Then:
Mount the volume under $MOUNT on two client hosts, with --fopen-keep-cache.
On the first host, run
On the second host, run
On the first host, run
and it displays OLD not NEW.
Update: it has been pointed out that, contrary to what is specified in man.glusterfs, specifying --fopen-keep-cache causes --auto-invalidation to no longer default to "on", so it is necessary to explicitly add "--auto-invalidation=on" to the mount options. And in that case, auto-invalidation will mask the problem by causing the above simple test case to work.
Here is a revised example that will reproduce the problem even when auto-invalidation is explicitly turned on:
On the first host, run
On the second host, while the first host's sleep is running, run
The full output of the command that failed:
Or, with the updated test case that demonstrates the problem even when auto-invalidation is turned on:
Expected results:
(sorry ... typos in previous version of this issue report ... fixed now).
Or, in the revised test case that reproduces the problem even when auto-invalidation is turned on:
If one does
and repeats the tests, the expected result is achieved.
The following patch seems to work on preliminary testing to allow cache invalidation to work even when stat-prefetch is disabled:
I can reformat and submit as an actual patch / merge request if desired.