gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.53k stars 1.07k forks source link

glusterfs stale inodenumber showed in ls -li command #1411

Closed cynthia277 closed 3 years ago

cynthia277 commented 3 years ago

Description of problem: with following script(test.sh) running simultaneously on multiple glusterfs clients, it is showed that some files has different glusterfs inode number from different clients.

The exact command to reproduce the issue:

The full output of the command that failed:

# cat test.sh #!/bin/bash while (true) do hostname=`hostname` basedir="/mnt/test/testdir/" sname=$basedir$hostname$RANDOM tname=$basedir$hostname touch $sname;echo "test message on $hostname">$sname mv $sname $tname str1=`ls -li $tname|awk '{gsub(/^\s+|\s+$/, "");print}'|cut -d " " -f1` #echo $str1 str2=`ls -li $basedir/*|grep $hostname|awk '{gsub(/^\s+|\s+$/, "");print}'|cut -d " " -f1` #echo $str2 if [[ $str1 != $str2 ]]; then echo "error happen" echo $sname "inode number" $str1 echo $tname "inode number" $str2 exit fi sleep 0.01 done

Expected results:

Additional info:

- The output of the gluster volume info command:

[root@mn-1:/home/robot] # ./test.sh ls: cannot access '/mnt/test/testdir//mn-05717': No such file or directory ls: cannot access '/mnt/test/testdir//mn-020082': No such file or directory ls: cannot access '/mnt/test/testdir//mn-08854': No such file or directory ls: cannot access '/mnt/test/testdir//mn-0': No such file or directory ls: cannot access '/mnt/test/testdir//ei-02720': No such file or directory ls: cannot access '/mnt/test/testdir//ei-0': No such file or directory error happen /mnt/test/testdir/mn-15091 inode number 11936439675212080882 /mnt/test/testdir/mn-1 inode number 9937470456549283772 [root@mn-1:/home/robot] # ls -li /mnt/test/testdir/* 12989798665001610988 -rw-r--r-- 1 root root 21 Aug 4 13:46 /mnt/test/testdir/ei-0 10371637994493467539 -rw-r--r-- 1 root root 21 Aug 4 13:46 /mnt/test/testdir/mn-0 9937470456549283772 -rw-r--r-- 1 root root 21 Aug 4 13:46 /mnt/test/testdir/mn-1 [root@mn-1:/home/robot] # ls -li /mnt/test/testdir/mn-1 11936439675212080882 -rw-r--r-- 1 root root 21 Aug 4 13:46 /mnt/test/testdir/mn-1 [root@mn-1:/home/robot]

- The operating system / glusterfs version:

cynthia277 commented 3 years ago

root@mn-1:/home/robot]

gluster v info test

Volume Name: test Type: Replicate Volume ID: 6ae6b059-61dc-43f7-975c-1cfe5bc2ba46 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: mn-0.local:/mnt/bricks/test/brick Brick2: mn-1.local:/mnt/bricks/test/brick Brick3: dbm-0.local:/mnt/bricks/test/brick Options Reconfigured: features.cache-invalidation-timeout: 1 features.cache-invalidation: on transport.address-family: inet nfs.disable: on performance.client-io-threads: off cluster.server-quorum-ratio: 51% [root@mn-1:/home/robot]

cynthia277 commented 3 years ago

scripts used to reproduce this issue: while (true) do hostname=hostname basedir="/mnt/test/testdir/" sname=$basedir$hostname$RANDOM tname=$basedir$hostname touch $sname;echo "test message on $hostname">$sname mv $sname $tname str1=ls -li $tname|awk '{gsub(/^\s+|\s+$/, "");print}'|cut -d " " -f1

echo $str1

str2=ls -li $basedir/*|grep $hostname|awk '{gsub(/^\s+|\s+$/, "");print}'|cut -d " " -f1

echo $str2

if [[ $str1 != $str2 ]]; then echo "error happen" echo $sname "inode number" $str1 echo $tname "inode number" $str2 exit fi sleep 0.01 done

run this script on multiple glusterfs clients.

itisravi commented 3 years ago

@cynthia277

Looking at the script, it looks like each client (assuming they are on different machines) creates and renames its own file, so I wonder if multiple clients is a problem here. It looks like the inode number got from stat and readdir on the same client is sometimes different. Would that be correct? Or are all clients mounted on the same machine and creating+renaming the same file name?

itisravi commented 3 years ago

It looks like if the clients are on the same machine, the source is different ($RANDOM) but the destination file remains the same.

cynthia277 commented 3 years ago

it is testing on glusterfs7, when i do this test i am using different glusterfs clients on different client nodes.

cynthia277 commented 3 years ago

does glusterfs support such a scenario that each glusterfs(on different vm) clients(e.g client-x) touch a new file and then mv the file to client-x(file name) repeatedly with in the same dir? this is our use case, and occassionally this simultaneous operation cause such stale inode number issue. i do not find any error logs from server side, the mv operation completes. and in normal scenario, when mv finished, the stale(old) inode entry is not found in statedump. i am wondering is it possible that after the mv , some other fop(e.g lookup) with the old gfid arrived at the server side, and the server add this stale gfid to inode table again ??

itisravi commented 3 years ago

that after the mv , some other fop(e.g lookup) with the old gfid arrived at the server side

Ideally creating multiple files and renaming to the same file should not cause issues (unless there is a bug, which you are seeing). Since you say each client is on different machine, the sname and tname will be separate on each client. Eg:
Client1 does: while true (sname-c1-$RANDOM renamed to sname-c1) Client2 does: while true (sname-c2-$RANDOM renamed to sname-c2)

So I think multiple clients is not a problem here. The issue seems to be inconsistency between getting the inode number from stat and readdir when renaming in a while() loop from the same client. I will see if I can recreate the issue.

cynthia277 commented 3 years ago

i could also produce this issue in distributed brick [root@mn-0:/root]

ls

27211 test.sh [root@mn-0:/root]

./test.sh

error happen /mnt/test/testdir/mn-022715 inode number 12379940563250427282 /mnt/test/testdir/mn-0 inode number 13062365457136091329 [root@mn-0:/root]

^C

[root@mn-0:/root]

gluster v status test2

Status of volume: test2 Gluster process TCP Port RDMA Port Online Pid

Brick mn-0.local:/mnt/bricks/test/brick 53958 0 Y 29307

Task Status of Volume test2

There are no active volume tasks

[root@mn-0:/root]

gluster v info test2

Volume Name: test2 Type: Distribute Volume ID: 16821b10-2b6b-47bb-ae05-d6ff4de7f126 Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: mn-0.local:/mnt/bricks/test/brick Options Reconfigured: transport.address-family: inet nfs.disable: on cluster.server-quorum-ratio: 51% [root@mn-0:/root]

ls -li /mnt/test/testdir/

total 2 13768009798257677930 -rw-r--r-- 1 root root 21 Aug 4 16:49 ei-0 13062365457136091329 -rw-r--r-- 1 root root 21 Aug 4 16:50 mn-0 11542463644429274966 -rw-r--r-- 1 root root 21 Aug 4 16:50 mn-1 11308125171974775005 -rw-r--r-- 1 root root 0 Aug 4 16:50 mn-115777 [root@mn-0:/root]

ls -li /mnt/test/testdir/mn-0

12379940563250427282 -rw-r--r-- 1 root root 21 Aug 4 16:50 /mnt/test/testdir/mn-0 [root@mn-0:/root]

df /mnt/test

Filesystem 1K-blocks Used Available Use% Mounted on mn-0.local:/test2 1014656 41344 802176 5% /mnt/test [root@mn-0:/root] #

itisravi commented 3 years ago

@cynthia277 If you can, please also try disabling performance translators 1 by 1 and mounting with --attribute-timeout=0 --entry-timeout=0 to see if you can isolate anything.

cynthia277 commented 3 years ago

by the way in replicated brick, when i restart the client node this issue still exists, but when i isolate the brick node(with stale gfid in statedump) with iptables this issue disappear, all clients back to normal, so this issue should should be a server side error.

cynthia277 commented 3 years ago

i've tested mount with --attribute-timeout=0 --entry-timeout=0, not helpful for this issue problem still exists

cynthia277 commented 3 years ago

@itisravi i add some debug trace to check this issue today, i think this issue is cause by two fops readdirp and rename ongoing at the same time. in normal case when mv finish the old-name inode(mv old-name new-name) should be delted by __inode_retire, and then destroyed by purging in abnorml case , because readdir fop racing with rename fop, it will add nlookup number of inode so that the inode does not do _inode_retire, instead it do _inode_passivate thus it is put into the lru list please refer to the mnt-bricks-test-brick.log(distributed volume test) i attached

normal.log abnormal case.log

cynthia277 commented 3 years ago

in abnormal case log: RENAME_CBK mn-119463 ==> mn-1 old mn-1 gfid is:c8c976cd-6cfd-449b-8b99-ccab0e026b4e after [2020-08-05 09:08:53.625275] T [MSGID: 101183] [inode.c:192:dentry_destroy] 0-test-server: destroying dentry mn-1 and [2020-08-05 09:08:53.625359] T [inode.c:1122:inode_forget] (-->/usr/lib64/glusterfs/7.0/xlator/protocol/server.so(+0x452a1) [0x7fa2798bd2a1] -->/usr/lib64/glusterfs/7.0/xlator/protocol/server.so(+0x34108) [0x7fa2798ac108] -->/lib64/libglusterfs.so.0(inode_forget+0x65) [0x7fa27f22ff65] ) 0-test-server: forget inode gfid=c8c976cd-6cfd-449b-8b99-ccab0e026b4e,nlookup=0 then [2020-08-05 09:08:53.625565] T [MSGID: 101183] [inode.c:192:dentry_destroy] 0-test-server: destroying dentry mn-119463 however before _inode_retire, reddirp_cbk triggered [2020-08-05 09:08:53.625844] T [inode.c:1083:inode_lookup] (-->/lib64/libglusterfs.so.0(gf_link_inodes_from_dirent+0x2e) [0x7fa27f2497be] -->/lib64/libglusterfs.so.0(+0x53755) [0x7fa27f249755] -->/lib64/libglusterfs.so.0(inode_lookup+0x51) [0x7fa27f22fe11] ) 0-test-server: inode_lookup: nlookup=1 to increase the nlookup so _inode_passivate is executed instead of _inode_retire

cynthia277 commented 3 years ago

@itisravi i find that fop readdirp and lookup (maybe some others) have a chance to execute after server4_post_rename->forget_inode_if_no_dentry and before the final inode_unref (ref =0) , that makes nlookup non zero and inode linked back to parent inode, i can not find any sync mechanisms between those fops, i think after rename need to firstly make sure the old inode is destroyed before any other fop link this old inode back to parent inode. i'd like to hear your opinion on this issue.

itisravi commented 3 years ago

@cynthia277 I haven't had a chance to look into this in detail yet. Your test case does lookup and readdir after the mv command completes successfully. That means only after server4_rename_cbk() does server4_post_rename (which does the inode related stuff) and server_submit_reply reaches the client, the client can send the lookup/readdir. So I need to dig further(probably next week) to see if there is a race.

cynthia277 commented 3 years ago

ok, looking forward for your reply!

cynthia277 commented 3 years ago

@itisravi another finding, when i remove gf_link_inode_from_dirent from server4_readdirp_cbk this issue disappear!

cynthia277 commented 3 years ago

i find the gf_link_inode_from_dirent is added in https://review.gluster.org/#/c/glusterfs/+/7700/ but have no clue why adding this.

cynthia277 commented 3 years ago

@itisravi how about remove gf_link_inode_from_dirent in server4_readdirp_cbk? will that cause some other issue?

itisravi commented 3 years ago

@cynthia277 It seems to be added for User Serviceable Snapshots patch like you pointed out.

-        /* TODO: need more clear thoughts before calling this function. */
-        /* gf_link_inodes_from_dirent (this, state->fd->inode, entries); */
+        gf_link_inodes_from_dirent (this, state->fd->inode, entries);

@raghavendrabhat, any pointers on why the inode linking was uncommented in the USS patch? The function comment seems to indicate that its not error free:

/* TODO: Currently, with this function, we will be breaking the
   policy of 1-1 mapping of kernel nlookup refs with our inode_t's
   nlookup count.
   Need more thoughts before finalizing this function
*/
cynthia277 commented 3 years ago

@itisravi @raghavendrabhat i run some local test with "gf_link_inodes_from_dirent (this, state->fd->inode, entries);" removed this issue is not reproduced, and i have not find other bad impact.

cynthia277 commented 3 years ago

i think there is still a chance that for rename operation before really delete inode there is some other ongoing fop link the inode back. like fop lookup , in server4_post_lookup the inode will be linked back to inode table

raghavendrabhat commented 3 years ago

@cynthia277 It seems to be added for User Serviceable Snapshots patch like you pointed out.

-        /* TODO: need more clear thoughts before calling this function. */
-        /* gf_link_inodes_from_dirent (this, state->fd->inode, entries); */
+        gf_link_inodes_from_dirent (this, state->fd->inode, entries);

@raghavendrabhat, any pointers on why the inode linking was uncommented in the USS patch? The function comment seems to indicate that its not error free:

/* TODO: Currently, with this function, we will be breaking the
   policy of 1-1 mapping of kernel nlookup refs with our inode_t's
   nlookup count.
   Need more thoughts before finalizing this function
*/

One of the reasons that I can think of (the change was made 6 years ago) is to handle the way snapview-server daemon manages inodes. Since, snapview-server is the one which would actually interact with snapshots, to avoid gfid conflicts of the same file from 2 different snapshots, snapview-server would generate a new gfid for each item it fetches from the snapshot. So, if inode linking is not done, then the entity communicating with snapview-server daemon (fuse client, gNFS server, gfapi from samba etc) would get errors when they try to send fops on some of the inodes they linked in their respective inode tables in future. Because, the snapview-server would fail to recognize that gfid and thus would send an error. IIRC this was mainly done to avoid errors with gNFS usage.

cynthia277 commented 3 years ago

@raghavendrabhat , i checked the glusterfs log it seems that sometimes, when readdirp and rename are ongoing at the same time, readdirp may re-link the inode which is supposed to be put to purge list and then deleted, while when i comment off that line, this issue seldom appear anymore, any idea how to avoid this issue?

cynthia277 commented 3 years ago

@itisravi @raghavendrabhat i currently only comment off that line, because this issue occassinally appear in our product, especially when there are many glusterfs clients(30-50),but i think this is only a work around solution, is there any other better solution?

pranithk commented 3 years ago

@cynthia277 Changed the script so that I can run it on a single machine. So I launch multiple scripts on the same mount/volume with this script. I am seeing the error. Interesting thing is, I am able to see the error even after commenting out the readdirp inode-linking code. Still debugging. Will let you know my findings. Not sure if the issue is with the script.

17:59:04 :) ⚡ cat issue-1411.sh

myuniq=$(uuidgen)
while true
do
    basedir="/mnt/r3/"
    sname="$basedir$myuniq-uniq-$(uuidgen)"
    tname=$basedir$myuniq
    touch $sname;echo "test message on $myuniq">$sname
    mv $sname $tname
    str1=$(ls -li $tname | awk '{gsub(/^\s+|\s+$/, "");print}' | cut -d " " -f1)
    #echo $str1
    str2=$(ls -li $basedir/*|grep $myuniq|awk '{gsub(/^\s+|\s+$/, "");print}'|cut -d " " -f1)
    #echo $str2
    if [[ $str1 != $str2 ]]; then
        echo "error happen" >>"error-$myuniq"
        echo $sname "inode number" $str1 >>"error-$myuniq"
        echo $tname "inode number" $str2 >>"error-$myuniq"
        exit
    fi
    sleep 0.01
done
pranithk commented 3 years ago

@cynthia277 I meant I am not sure if I changed the script you gave in such a way that the error is in my version of the script. Let me know if you see any bug in my version of your script.

pranithk commented 3 years ago

It looks like just after the rename 'ls -li' is giving old inode number but by the time second ls -li $basedir/* is executed, the inode number is fine. I will need to debug it using fuse-dump tool developed by Csaba. Will update with what I find.

root@localhost - /mnt/r3 17:42:18 :) ⚡ cat error-95902966-a411-42b9-b6c6-85e6d9a3fc94 error happen /mnt/r3/95902966-a411-42b9-b6c6-85e6d9a3fc94-uniq-f13d0dcf-0f23-4bc4-b569-a70e46bdc0a8 inode number 13276187048770708019 /mnt/r3/95902966-a411-42b9-b6c6-85e6d9a3fc94 inode number 13097461331628363336

root@localhost - /var/log/glusterfs 17:42:37 :) ⚡ grep uniq-f13d0dcf mnt-r3.log [2020-09-09 12:12:18.584890 +0000] I [fuse-bridge.c:2690:fuse_create_cbk] 0-glusterfs-fuse: 4375: CREATE() /95902966-a411-42b9-b6c6-85e6d9a3fc94-uniq-f13d0dcf-0f23-4bc4-b569-a70e46bdc0a8 => 0x7f67f00185c0 (ino=13097461331628363336) [2020-09-09 12:12:18.604351 +0000] I [fuse-bridge.c:1347:fuse_attr_cbk] 0-glusterfs-fuse: 4397: FSTAT() /95902966-a411-42b9-b6c6-85e6d9a3fc94-uniq-f13d0dcf-0f23-4bc4-b569-a70e46bdc0a8 => 13097461331628363336 [2020-09-09 12:12:18.610287 +0000] I [fuse-bridge.c:1347:fuse_attr_cbk] 0-glusterfs-fuse: 4405: STAT() /95902966-a411-42b9-b6c6-85e6d9a3fc94-uniq-f13d0dcf-0f23-4bc4-b569-a70e46bdc0a8 => 13097461331628363336 [2020-09-09 12:12:18.613416 +0000] I [fuse-bridge.c:2494:fuse_rename_cbk] 0-glusterfs-fuse: 4407: /95902966-a411-42b9-b6c6-85e6d9a3fc94-uniq-f13d0dcf-0f23-4bc4-b569-a70e46bdc0a8 -> /95902966-a411-42b9-b6c6-85e6d9a3fc94 => 0 (buf->ia_ino=13097461331628363336)

root@localhost - /var/log/glusterfs 17:42:54 :) ⚡ egrep "(13276187048770708019|13097461331628363336)" mnt-r3.log [2020-09-09 12:12:18.529420 +0000] I [fuse-bridge.c:2690:fuse_create_cbk] 0-glusterfs-fuse: 4270: CREATE() /95902966-a411-42b9-b6c6-85e6d9a3fc94-uniq-aceb188c-2676-4f38-b38e-f93347727839 => 0x7f67f001d640 (ino=13276187048770708019) [2020-09-09 12:12:18.537299 +0000] I [fuse-bridge.c:1347:fuse_attr_cbk] 0-glusterfs-fuse: 4292: FSTAT() /95902966-a411-42b9-b6c6-85e6d9a3fc94-uniq-aceb188c-2676-4f38-b38e-f93347727839 => 13276187048770708019 [2020-09-09 12:12:18.540216 +0000] I [fuse-bridge.c:1347:fuse_attr_cbk] 0-glusterfs-fuse: 4297: STAT() /95902966-a411-42b9-b6c6-85e6d9a3fc94-uniq-aceb188c-2676-4f38-b38e-f93347727839 => 13276187048770708019 [2020-09-09 12:12:18.541540 +0000] I [fuse-bridge.c:2494:fuse_rename_cbk] 0-glusterfs-fuse: 4301: /95902966-a411-42b9-b6c6-85e6d9a3fc94-uniq-aceb188c-2676-4f38-b38e-f93347727839 -> /95902966-a411-42b9-b6c6-85e6d9a3fc94 => 0 (buf->ia_ino=13276187048770708019) [2020-09-09 12:12:18.545028 +0000] I [fuse-bridge.c:1347:fuse_attr_cbk] 0-glusterfs-fuse: 4305: STAT() /95902966-a411-42b9-b6c6-85e6d9a3fc94 => 13276187048770708019 [2020-09-09 12:12:18.545280 +0000] I [fuse-bridge.c:1347:fuse_attr_cbk] 0-glusterfs-fuse: 4307: STAT() /95902966-a411-42b9-b6c6-85e6d9a3fc94 => 13276187048770708019 [2020-09-09 12:12:18.584890 +0000] I [fuse-bridge.c:2690:fuse_create_cbk] 0-glusterfs-fuse: 4375: CREATE() /95902966-a411-42b9-b6c6-85e6d9a3fc94-uniq-f13d0dcf-0f23-4bc4-b569-a70e46bdc0a8 => 0x7f67f00185c0 (ino=13097461331628363336) [2020-09-09 12:12:18.604351 +0000] I [fuse-bridge.c:1347:fuse_attr_cbk] 0-glusterfs-fuse: 4397: FSTAT() /95902966-a411-42b9-b6c6-85e6d9a3fc94-uniq-f13d0dcf-0f23-4bc4-b569-a70e46bdc0a8 => 13097461331628363336 [2020-09-09 12:12:18.610287 +0000] I [fuse-bridge.c:1347:fuse_attr_cbk] 0-glusterfs-fuse: 4405: STAT() /95902966-a411-42b9-b6c6-85e6d9a3fc94-uniq-f13d0dcf-0f23-4bc4-b569-a70e46bdc0a8 => 13097461331628363336 [2020-09-09 12:12:18.613416 +0000] I [fuse-bridge.c:2494:fuse_rename_cbk] 0-glusterfs-fuse: 4407: /95902966-a411-42b9-b6c6-85e6d9a3fc94-uniq-f13d0dcf-0f23-4bc4-b569-a70e46bdc0a8 -> /95902966-a411-42b9-b6c6-85e6d9a3fc94 => 0 (buf->ia_ino=13097461331628363336) [2020-09-09 12:12:18.619824 +0000] I [fuse-bridge.c:1347:fuse_attr_cbk] 0-glusterfs-fuse: 4419: STAT() /95902966-a411-42b9-b6c6-85e6d9a3fc94 => 13097461331628363336

pranithk commented 3 years ago

Based on fuse-dump found this issue I was facing to be readdir-ahead issue. @cynthia277 Do you have this feature turned on by chance? If yes disable it. I will try to redo the tests with readdir-ahead disabled tomorrow IST.

pranithk commented 3 years ago

@cynthia277 I tried recreating the issue by turning off readdir-ahead on release-7. Ran the test for about 10 minutes(It used to error out in the first minute with readdir-ahead on). Didn't see the failure. Could you also confirm with your test?

cynthia277 commented 3 years ago

@pranithk glad to see your response, in my env we can not turn off "read-ahead" right now, i comment off ""gf_link_inodes_from_dirent (this, state->fd->inode, entries);" in function server4_readdirp_cbk as a work around for this issue, seems this issue disappeared with that work around

pranithk commented 3 years ago

@cynthia277 Not read-ahead, readdir-ahead. Do you have a test cluster where you can test this out and provide the results? I don't know if commenting gf_link_inodes_from_dirent() reduced the race-window or it is indeed another bug that is getting fixed.

cynthia277 commented 3 years ago

@pranithk i have not tried disable readdir-ahead, but when mount with option -o use-readdirp=no, this issue will disappear.

cynthia277 commented 3 years ago

@pranithk i will try to disable readdir-ahead tomorrow, but i would like to mention that when this issue happen, evern after restart client node this issue still exist, so i think this is a server side issue.

pranithk commented 3 years ago

@cynthia277 readdir-ahead doesn't implement readdir(). Maybe you don't see the issue because of that when you set use-readdirp=no. In the worst case there are potentially 2 bugs in this usecase, one on the client side and the other on the server side. You should definitely disable readdir-ahead irrespective of whether it fixes just this issue for you or not considering the issues we are finding (#1472 )

Could you let me know if just disabling readdir-ahead fixes the issue for you? i.e. without commenting (gf_link_inodes_from_dirent()). Otherwise there is not enough info for me to debug the issue further.

cynthia277 commented 3 years ago

@pranithk , i 've tried to disable readdir-ahead "performance.readdir-ahead: off" without commenting off(gf_link_inodes_from_dirent()), but this issue still exists. i think this issue is caused by race condition between readdirp and rename fop on server side, 1> rename ->inode_unref put the old a inode to lru list (suppose script touch a; then mv a to b each time) 2> readdirp server4_readdirp_cbk is also ongoing, and gf_link_inodes_from_dirent put the inode(with old a gfid) back to hash table. so the stale inode stay in server side inode table

pranithk commented 3 years ago

@cynthia277 Thanks for the confirmation. If there is a lookup and rename that are ongoing in parallel also this may happen but the race-window is comparatively lower. Let me see what can be done to fix this. Meanwhile you can resort to the commenting out the code.

pranithk commented 3 years ago

@cynthia277 Since I am not able to recreate the issue. Will it be possible for you to give me logs that confirm your hypothesis?

@pranithk , i 've tried to disable readdir-ahead "performance.readdir-ahead: off" without commenting off(gf_link_inodes_from_dirent()), but this issue still exists. i think this issue is caused by race condition between readdirp and rename fop on server side, 1> rename ->inode_unref put the old a inode to lru list (suppose script touch a; then mv a to b each time) 2> readdirp server4_readdirp_cbk is also ongoing, and gf_link_inodes_from_dirent put the inode(with old a gfid) back to hash table. so the stale inode stay in server side inode table

There is code in posix_readdirp to pick the inode->gfid from inode-table instead of fetching it from the entry if it is already present in dentry-list. Check inode_grep() call in posix_readdirp_fill(). That is one more possibility I see that readdirp could fill in different gfid for an entry.

Will it be possible for you to add logs to prove either of these hypotheses?

Another possibility is to patch the following and check if the issue persists even without commenting gf_link_inodes_frome_dirent

--- a/xlators/storage/posix/src/posix-inode-fd-ops.c
+++ b/xlators/storage/posix/src/posix-inode-fd-ops.c
@@ -5657,11 +5657,12 @@ posix_readdirp_fill(xlator_t *this, fd_t *fd, gf_dirent_t *entries,

     list_for_each_entry(entry, &entries->list, list)
     {
-        inode = inode_grep(fd->inode->table, fd->inode, entry->d_name);
-        if (inode)
-            gf_uuid_copy(gfid, inode->gfid);
-        else
-            bzero(gfid, 16);
+//        inode = inode_grep(fd->inode->table, fd->inode, entry->d_name);
+//        if (inode)
+//            gf_uuid_copy(gfid, inode->gfid);
+//        else
+//            bzero(gfid, 16);
+        bzero(gfid, 16);

         strcpy(&hpath[len + 1], entry->d_name);
pranithk commented 3 years ago

@cynthia277 Any update on this?

stale[bot] commented 3 years ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] commented 3 years ago

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.