afr-lock-heal-basic.t & afr-lock-heal-advanced.t test failure on s390x

cnnaik commented 4 years ago

Description of problem: ./tests/basic/fencing/afr-lock-heal-basic.t & ./tests/basic/fencing/afr-lock-heal-advanced.t tests fail on s390x architecture in Glusterfs v8.0 on ubuntu 20.04, rhel8.x.

Test has passed on other ubuntu distributions, only difference seen in passing and failing distributions is in the glibc version getting used. On failing distributions glibc version is >=2.28

The exact command to reproduce the issue: ./run-tests.sh ./tests/basic/fencing/afr-lock-heal-basic.t ./run-tests.sh ./tests/basic/fencing/afr-lock-heal-advanced.t

The full output of the command that failed: Attaching the test output. afr-lock-heal-basic.txt afr-lock-heal-advanced.txt

afr-lock-heal-basic.t - Failed tests: 25, 32-34 Basically test fails when it restarts bricks and checks that the lock has healed on it. Getting below error:
```
./tests/basic/fencing/../../include.rc: line 344: [: 
posixlk.posixlk[0]ACTIVE=type=WRITEwhence=0start=0pid=12778owner=ea310000: unary operator expected
```

afr-lock-heal-advanced.txt - Failed Tests: 24

not ok  24 [     16/  90097] <  78> '2 get_active_lock_count /d/backends/patchy2 FILE1|gfid:5584633f-a4a6-45de-be0c- 9a21fc11f9f5 FILE2|gfid:a9bef1c7-2a5b-4632-9904-69f272bf2b01' -> 'Got "0" instead of "2"'

Expected results: Test should pass.

Additional info: - The operating system / glusterfs version: Glusterfs v8.0 ubuntu 20.04, rhel8.x Architecture: s390x

Team, Could you please provide any pointers on this issue?

itisravi commented 4 years ago

@cnnaik Interesting, I wonder how glibc version makes a difference to the test. You could put an exit in the .t after test#25 (i.e. line 71) and examine the system:

diff --git a/tests/basic/fencing/afr-lock-heal-basic.t b/tests/basic/fencing/afr-lock-heal-basic.t
index c5d7d6fe8..129252d3d 100644
--- a/tests/basic/fencing/afr-lock-heal-basic.t
+++ b/tests/basic/fencing/afr-lock-heal-basic.t
@@ -69,6 +69,7 @@ TEST sleep 10 #Needed for client to re-open fd? Otherwise client_pre_lk_v2() fai
 b3_sdump=$(generate_brick_statedump $V0 $H0 $B0/${V0}2)
 c1_lock_on_b3="$(egrep "$inode" $b3_sdump -A3| egrep 'ACTIVE.*client-2'| uniq| awk '{print $1,$2,$3,S4,$5,$6,$7,$8}'|tr -d '(,), ,')"
 TEST [ "$c1_lock_on_b1" == "$c1_lock_on_b3" ]
+exit

Maybe the 10 second sleep at line 67 is not enough? If you wait for some more time and manually take the statedump of the 3rd brick once again, does the lock gets healed?

By the way, this lock healing is not generic and was written for gluster-block fencing use case (https://github.com/gluster/glusterfs/issues/613), so it has no relevance to normal AFR operations like replication, healing etc.

cnnaik commented 4 years ago

@itisravi Tried to increase the sleep time to 50, but still test failed. Tried to print the statedump variable in the test, after test 25 exited the test & tried the same command as in test to grep on statedump as below: egrep "$inode" /var/run/gluster/d-backends-patchy2.13988.dump.1599483526 -A3| egrep 'ACTIVE.*client-2' But this command doesn't return anything, the second egrep fails to find the ACTIVE string in the statedump.

itisravi commented 4 years ago

@cnnaik 50 seconds is plenty time for the heal to complete. Any errors in/var/log/glusterfs/glfs-client.C1.log? Client C1 is the one that does the lock heal when the 3rd brick comes back up.

If you are comfortable with gdb, you can see if __afr_lock_heal_synctask() function is hit in the C1 client process when the 3rd brick comes back up. ps aux|grep afr-lock-heal-basic|grep C1 should give you the pid of the process to attach gdb to.

itisravi commented 4 years ago

If you are debugging with gdb, you can exit + CTRL-C the test at line 79. Then attach gdb to client C1 and set the breakpoint in the above mentioned function, and do gluster vol start patchy force to bring the 3rd brick online. The breakpoint should be hit.

cnnaik commented 4 years ago

@itisravi Checked the logs after running the test, could see only below message before the actual failing test:

E [MSGID: 114044] [client-handshake.c:757:client_setvolume_cbk] 0-patchy-client-2: SETVOLUME on remote-host failed [{remote-error=Volume-ID different, possible case of same brick re-used in another volume}, {errno=22}, {error=Invalid argument}]

Also, tried to attach gdb to the client c1 and tried to run gluster vol start patchy force in another session, but breakpoint was not hit, also gdb gives message No symbol table is loaded. Use the "file" command. However executed test till linenumber 62 as the test failure is observed after this.

cnnaik commented 4 years ago

@itisravi Tried to use gdb with debug enabled during configure to resolve symbol issue. But still getting no symbol table message, so breakpoint is not getting hit. Also tried to run make as below make CFLAGS="-g -O0" && make install

itisravi commented 4 years ago

The SETVOLUME error is because you exited the .t midway (for debugging) and ran it again without cleaning up the old setup. Before re-running, you can kill all the gluster and afr-lock-heal-basic processes.

I usually ./configure --enable-debug and get the symbols. Even the afr-lock-heal-basic.c is compiled with debug flags (line 34 in the .t). Not sure what is different in your setup. The test case passes on my x86_64 fedora 32 system with glibc-2.31-4

cnnaik commented 4 years ago

@itisravi I have built glusterfs with ./configure --enable-debug and then after installing executed the test till line number 62. Post this did ps aux|grep afr-lock-heal-basic|grep C1 for the pid of the process & attached gdb and then set breakpoint using break __afr_lock_heal_synctask but it gives message :

No symbol table is loaded. Use the "file" command. Make breakpoint pending on future shared library load? (y or [n]) y

Please let me know if there Is any step that is missing.

itisravi commented 4 years ago

Yes, the steps are correct, here is my output.

cnnaik commented 4 years ago

@itisravi could you please tell if you are running the test with gdb with--args option or running it as just ./run-tests.sh ./tests/basic/fencing/afr-lock-heal-basic.t and then connecting gdb to the client process as mentioned here

itisravi commented 4 years ago

@cnnaik I ran the test with prove -vf ./tests/basic/fencing/afr-lock-heal-basic.t and then attached gdb to the client as mentioned above.

cnnaik commented 3 years ago

@itisravi Thanks for the information and pointers. We could resolve the issue by building gccv7.5.0 from source and updating environment variables and rebuilding glusterfs.

itisravi commented 3 years ago

Thanks for the update. closing the issue.

gluster / glusterfs

afr-lock-heal-basic.t & afr-lock-heal-advanced.t test failure on s390x #1468