gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.53k stars 1.07k forks source link

features/shard: delay unlink of a file that has fd_count > 0 #1563

Closed vh05 closed 3 years ago

vh05 commented 3 years ago

When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also.

Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4.

Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard).

C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file.

This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards.

Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed.

In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins.

Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used.

fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath vharihar@redhat.com

gluster-ant commented 3 years ago

Can one of the admins verify this patch?

gluster-ant commented 3 years ago

Can one of the admins verify this patch?

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/recheck smoke

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/recheck smoke

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

The regression test suite is failing on Jenkins run: https://build.gluster.org/job/gh_centos7-regression/691/console Failed tests: The below test is continuously failing. ./tests/00-geo-rep/georep-basic-tarssh-ec.t

These 2 tests are intermittently failing ./tests/bugs/glusterd/brick-mux.t ./tests/basic/afr/afr-no-fsync.t

But these tests are passing on my local setup.

# prove -rfv tests/00-geo-rep/georep-basic-tarssh-ec.t
..
All tests are successful.
Files=1, Tests=81, 139 wallclock secs ( 0.05 usr  0.02 sys +  4.63 cusr  1.81 csys =  6.51 CPU)
Result: PASS

# prove -rfv ./tests/bugs/glusterd/brick-mux.t
..
ok  34 [   5038/    423] <  79> '2 online_brick_count'
ok
All tests successful.
Files=1, Tests=34, 35 wallclock secs ( 0.03 usr  0.01 sys +  4.25 cusr  1.37 csys =  5.66 CPU)
Result: PASS

# prove -rfv tests/basic/afr/afr-no-fsync.t
..
ok  10 [    696/    177] <  18> '! gluster --mode=script --wignore volume profile patchy info incremental | grep FSYNC'
ok
All tests successful.
Files=1, Tests=10, 18 wallclock secs ( 0.02 usr  0.00 sys +  0.59 cusr  0.67 csys =  1.28 CPU)
Result: PASS

None of the tests are related to my patch changes

xhernandez commented 3 years ago

/run regression

xhernandez commented 3 years ago

/run regression

xhernandez commented 3 years ago

tests/features/trash.t seems to be crashing on unlink path, so it's probably related to this patch. Can you check it ?

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/recheck smoke

xhernandez commented 3 years ago

/run regression

vh05 commented 3 years ago

/recheck smoke

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/recheck smoke

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

deepshikhaaa commented 3 years ago

/run regression

deepshikhaaa commented 3 years ago

/run regression

deepshikhaaa commented 3 years ago

/run regression

deepshikhaaa commented 3 years ago

/run regression

vh05 commented 3 years ago

/run regression

xhernandez commented 3 years ago

I'll wait till tomorrow before merging the patch to allow other to complete review.