gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.72k stars 1.08k forks source link

Assertion failed: !(*take_lock) will cause stale inodelk and block subsequent IO #4425

Open chen1195585098 opened 1 week ago

chen1195585098 commented 1 week ago

Description of problem: Same as #3065 .

There are many Assertion failed: !(*take_lock) appearing successively in our test and production environemnts.

After preliminary investigation, it seems that there is a race between __afr_eager_lock_handle and afr_wakeup_same_fd_delayed_op. In the lifecycle of eager lock, when transactions complete fop phase, they will be moved to inode->post_op list, which is expected to be waked up as the timer expired or the last transaction on this inode get finished. However, such a wakeup process will be triggered each time a FLUSH is issued. When the first fop finished, afr_flush is called and it will wakeup delayed post_op on the same inode from afr_wakeup_same_fd_delayed_op. Once this operation succeed, lock->delay_timer = NULL will be set, which means all subsequent transactions need to wait and will be put into next lifecycle.

When a new write transaction on the same inode comes at the same time, howerver, it will be incorrectly added to owners list due to lock->release == false, lock->delay_timer == NULL, list_empty(&lock->owners) == true and issues an extra lock request in current lifecycle, which makes stale inodelk.

[2024-10-21 06:52:46.969255] I [afr-transaction.c:2776:__afr_eager_lock_handle] 0-LOCK_TEST: TID=4209133248, ctx = 0x7f04f400d690, lock = 0x7f04f400d6e0, owners = 1, waitings = 0, frozens = 0, reason = lock->release = FALSE, lock->delay_timer reset to NLL, add to owners, take_lock = 0, do_pre_op = 1, lock->acquried = 1
[2024-10-21 06:52:47.694238] I [afr-common.c:4199:afr_wakeup_same_fd_delayed_op] 0-LOCK_TEST: TID=4217525952, set lock->delay_timer = NULL  # retrieve post_op here
[2024-10-21 06:52:47.696090] E [afr-transaction.c:2773:__afr_eager_lock_handle] (-->/usr/local/lib/glusterfs/8.2/xlator/cluster/replicate.so(+0x39dab) [0x7f04fa53fdab] -->/usr/local/lib/glusterfs/8.2/xlator/cluster/replicate.so(+0x3984c) [0x7f04fa53f84c] -->/usr/local/lib/glusterfs/8.2/xlator/cluster/replicate.so(+0x39796) [0x7f04fa53f796] ) 0-: Assertion failed: !(*take_lock)
[2024-10-21 06:52:47.696106] I [afr-transaction.c:2776:__afr_eager_lock_handle] 0-LOCK_TEST: TID=4187313856, ctx = 0x7f04f400d690, lock = 0x7f04f400d6e0, owners = 1, waitings = 0, frozens = 0, reason = , take_lock = 1, do_pre_op = 0, lock->acquried = 1

From the code, I guess both lock->release and lock->delay_timer are marks to indicate whether a new transaction can to be added in current lifecycle, because the two conditions seem to hold at the same time. (e.g. lock->release = _gf_true and lock->delay_timer = NULL are set in afr_delayed_changelog_wake_up_cbk, when current transaction is the last owner of lock).

After verification, it is found that setting lock->release = _gf_true in afr_wakeup_same_fd_delayed_op will handle this problem. It is the simplest way to avoid stale inodelks while the root cause of the race is still unsolved.

In fact, the root cause of this issue is that afr_wakeup_same_fd_delayed_op is triggered too often. Ideally, it should only be triggered once in the last transaction of inode in a eager lock lifecycle as described at #418 . If so, this race is gone. However, it is hard to determine whether a transaction is the last without introducing more complexity.

The exact command to reproduce the issue:

1、modify glusterfs code to intensify race.

diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
index 23505b7..0deb933 100644
--- a/xlators/cluster/afr/src/afr-common.c
+++ b/xlators/cluster/afr/src/afr-common.c
@@ -4225,6 +4225,7 @@ afr_delayed_changelog_wake_resume(xlator_t *this, inode_t *inode,
     }
     UNLOCK(&inode->lock);

+    usleep(((random() % 20) * 40 + 200) * 1000); // randomly sleep
     if (data_local) {
         data_local->transaction.resume_stub = stub;
     } else if (metadata_local) {
diff --git a/xlators/cluster/afr/src/afr-transaction.c b/xlators/cluster/afr/src/afr-transaction.c
index c29083a..7fa2427 100644
--- a/xlators/cluster/afr/src/afr-transaction.c
+++ b/xlators/cluster/afr/src/afr-transaction.c
@@ -2711,6 +2711,7 @@ __afr_eager_lock_handle(afr_local_t *local, gf_boolean_t *take_lock,
         }
     }

+    usleep(((random() % 20) * 40 + 200) * 1000); // randomly sleep
     if (lock->release) {
         list_add_tail(&local->transaction.wait_list, &lock->frozen);
         *take_lock = _gf_false;

2、create and start a normal replicated volume test, mount this volume. (e.g. mount -t glusterfs localhost:test /mnt) 3、write a same file simultaneously with 2 threads.

#!/usr/bin/python

import sys
import random
import time
import multiprocessing

def write_files(file_name):
    while True:
        with open(file_name, "w") as f:
            random.seed(time.time())
            result = random.random()
            f.write(str(result) * 1024)

def main(argv):
    thread_num = 2
    file_name = '/mnt/testfile'

    for num in range(int(thread_num)):
        proc = multiprocessing.Process(target=write_files, args=(file_name,))
        proc.start()

if __name__ == "__main__":
    main(sys.argv[1:])
    while True:
        time.sleep(1)

After Assertion failed: !(*take_lock) is printed in mount log, stale inodelk occurs, and can be checked with statedump. The full output of the command that failed:

**Expected results:** No assertion failures or stale inodelk when a same file is written simultaneously.
**- Is there any crash ? Provide the backtrace and coredump Assert failure will cause crash in DEBUG mode. **Additional info:**
**- The operating system / glusterfs version**: **Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration**
chen1195585098 commented 1 week ago

Hi, @pranithk , can we simply set lock->release = _gf_true to avoid this issue? When this issue is triggered, inodelk request from other mountpoint will be blocked,which greatly effect the availability of a replicated volume.

Will such a fix bring a worse effection? I noticed you provided this method in #3065, but no further modification was made.