Closed Bockeman closed 4 years ago
Urgent problem no longer an issue I repeated the volume stop/start again today, and, voila, brick06 on server verijolt came back online.
I don't understand this. My history shows that I did the stop, on this server, yesterday:
history | egrep 'stop|start'
866 2020-10-22 14:22:08 gluster volume stop gluvol0
867 2020-10-22 14:22:39 gluster volume stop gluvol1
868 2020-10-22 14:23:24 gluster volume start gluvol0
869 2020-10-22 14:23:41 gluster volume start gluvol1
So what got fixed? Why did the repeat today change things when yesterday it had no effect?
There's still something wrong Of course, now no longer so urgent, do we have any idea what caused this one brick to go offline? Is there more information which I might be able to supply?
As you mentioned brick06 was crashed that's why gluster cli was showing status N/A. After start the volume it is expected brick status is showing correctly by CLI. Can you please brick core with gdb and share the output "thread apply all bt full"
Some more info
Just to see if the immediate problem had been fixed by the volume stop/start, I looked at the brick log again.
The tail of /var/log/glusterfs/bricks/srv-brick06.log
is shown in the details below. Notice the timestamp is after the start (as logged in details of prior comment) and that the log appears to have been truncated.
It seems the brick process is crashed again. Can you attached a core with gdb and share the output "thread apply all bt full" ?
@mohit84 thanks for picking this up.
I understand that the CLI status for brick06 on server verijolt is showing "N" under "online" when the brick has crashed.
At the moment, brick06 on verijolt is showing "Y" despite the truncated brick log showing a potential crash (signal received: 11). However, verijolt is now working flat out (100% on 4 CPUs) attempting to recover, I assume.
The tail of /var/log/glusterfs/glfsheal-gluvol1.log
shows
Is CLI showing some different pid other than brick_pid that was crashed?
Can you check the pid status in ps -aef | grep
@mohit84 Please could you give me a bit more help in running debugger etc., it's been a while since I last did that.
Can you please brick core with gdb and share the output "thread apply all bt full"
could you tell me the commands to use.
ps -aef | grep 53399
root 53399 1 99 12:44 ? 04:53:46 /usr/sbin/glusterfsd -s verijolt --volfile-id gluvol1.verijolt.srv-brick06 -p /var/run/gluster/vols/gluvol1/verijolt-srv-brick06.pid -S /var/run/gluster/e81ffbefdc82
It means brick is running , the pid(53399) is same showing by CLI, right.
Before attach a core with gdb please install debug package of glusterfs.
1) gdb /usr/sbin/glusterfsd -c
Please share the dump of /var/log/glusterfs from all the nodes.
I'm not sure what you mean by dump and all the nodes. This directory is large (~11GB) as this du
shows:
export start_date=`date +%F\ %T`; \
du -xsmc /var/log/glusterfs/* \
2>&1 | sort -n | awk '{printf(" %8d\t%s\n",$1,substr($0,index($0,$2)))}'; \
echo " ${start_date}" && date +\ \ %F\ %T
1 /var/log/glusterfs/cli.log
1 /var/log/glusterfs/cmd_history.log
1 /var/log/glusterfs/gfproxy
1 /var/log/glusterfs/gluvol0-rebalance.log
1 /var/log/glusterfs/gluvol0-replace-brick-mount.log
1 /var/log/glusterfs/gluvol1-replace-brick-mount.log
1 /var/log/glusterfs/quota_crawl
1 /var/log/glusterfs/snaps
2 /var/log/glusterfs/glfsheal-gluvol0.log
2 /var/log/glusterfs/glfsheal-gluvol1.log
2 /var/log/glusterfs/srv-gluvol0.log
3 /var/log/glusterfs/glusterd.log
4 /var/log/glusterfs/scrub.log
7 /var/log/glusterfs/quotad.log
36 /var/log/glusterfs/gluvol1-rebalance.log_201004_0005.gz
64 /var/log/glusterfs/srv-gluvol1.log
134 /var/log/glusterfs/glustershd.log
303 /var/log/glusterfs/gluvol1-rebalance.log
409 /var/log/glusterfs/bitd.log_201004_0005.gz
473 /var/log/glusterfs/bricks
9762 /var/log/glusterfs/bitd.log
11195 total
2020-10-23 14:09:43
2020-10-23 14:09:43
for now you can share glusterd.log and brick logs from all the nodes.
It means brick is running , the pid(53399) is same showing by CLI, right.
Agreed, it looks as though the brick process is running, but I don't think this brick is in a working state. It should be being healed, but the pending count is not going down.
# Healing
server brick: 00 01 02 03 04 05 06 07
veriicon pending: 0 0 0 0 0 0 101653 0
verijolt pending: 0 0 0 0 0 0 0 0
veriicon split: 0 0 0 0 0 0 0 0
verijolt split: 0 0 0 0 0 0 0 0
veriicon healing: 0 0 0 0 0 0 0 0
verijolt healing: 0 0 0 0 0 0 0 0
2020-10-23 14:26:07
Script used to collate healing status
Before attach a core with gdb please install debug package of glusterfs.
gdb /usr/sbin/glusterfsd -c <brick_core> Run thread apply all bt full and share the output.
On fedora, I installed -devel
versions, assuming that is what you meant be debug package
dnf install -y glusterfs-api-devel glusterfs-devel
where is the <brick_core>
located?
The core is saved the location configured at /proc/sys/kernel/core_pattern, you need to ti install glusterfs-debuginfo package.
The core is saved the location configured at /proc/sys/kernel/core_pattern, you need to ti install glusterfs-debuginfo package.
ls -l /proc/sys/kernel/core_pattern
-rw-r--r-- 1 root root 0 2020-10-23 14:50 /proc/sys/kernel/core_pattern
dnf info glusterfs-debuginfo
Last metadata expiration check: 0:37:38 ago on Fri 23 Oct 2020 14:14:36 BST.
Error: No matching Packages to list
cat /proc/sys/kernel/core_pattern
The full logs exceed the 10MB limit, so I pruned them a bit.
srv-brick05_201023.log srv-brick06_201023.log srv-brick07_201023.log
cat /proc/sys/kernel/core_pattern
cat /proc/sys/kernel/core_pattern
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
ls -l /usr/lib/systemd/systemd-coredump
-rwxr-xr-x 1 root root 61776 2020-09-21 08:47 /usr/lib/systemd/systemd-coredump
The date on this file does not correspond to yesterdays brick crash.
Below are the latest brick logs for brick06, i am not seeing any issue in brick logs.
[2020-10-23 11:44:20.574310] I [MSGID: 100030] [glusterfsd.c:2865:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 7.8 (args: /usr/sbin/glusterfsd -s verijolt --volfile-id gluvol1.verijolt.srv-brick06 -p /var/run/gluster/vols/gluvol1/verijolt-srv-brick06.pid -S /var/run/gluster/e81ffbefdc824bb9.socket --brick-name /srv/brick06 -l /var/log/glusterfs/bricks/srv-brick06.log --xlator-option -posix.glusterd-uuid=04eb8fdd-ebb8-44c9-9064-5578f43e55b8 --process-name brick --brick-port 49160 --xlator-option gluvol1-server.listen-port=49160) [2020-10-23 11:44:20.598225] I [glusterfsd.c:2593:daemonize] 0-glusterfs: Pid of current running process is 53399 [2020-10-23 11:44:20.602226] I [socket.c:957:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9 [2020-10-23 11:44:20.604864] I [MSGID: 101190] [event-epoll.c:679:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 [2020-10-23 11:44:20.604898] I [MSGID: 101190] [event-epoll.c:679:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2020-10-23 11:44:20.721821] I [rpcsvc.c:2689:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64 [2020-10-23 11:44:20.722754] W [socket.c:4161:reconfigure] 0-gluvol1-quota: disabling non-blocking IO [2020-10-23 11:44:20.762663] I [socket.c:957:__socket_server_bind] 0-socket.gluvol1-changelog: closing (AF_UNIX) reuse check socket 15 [2020-10-23 11:44:20.763042] I [trash.c:2449:init] 0-gluvol1-trash: no option specified for 'eliminate', using NULL [2020-10-23 11:44:22.347133] I [addr.c:52:compare_addr_and_update] 0-/srv/brick06: allowed = "", received addr = "192.168.0.21" [2020-10-23 11:44:22.347177] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:dce4fcdc-10f5-45de-8000-f4030618b984-GRAPH_ID:4-PID:1013-HOST:verilate-PC_NAME:gluvol1-client-5-RECON_NO:-6 (version: 6.9) with subvol /srv/brick06 [2020-10-23 11:44:22.350754] I [rpcsvc.c:864:rpcsvc_handle_rpc_call] 0-rpc-service: spawned a request handler thread for queue 0 [2020-10-23 11:44:23.384978] I [addr.c:52:compare_addr_and_update] 0-/srv/brick06: allowed = "", received addr = "192.168.0.29" [2020-10-23 11:44:23.384997] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:cead90c8-cf17-4f64-864e-d5fb99e0d21f-GRAPH_ID:4-PID:1025-HOST:veritosh-PC_NAME:gluvol1-client-5-RECON_NO:-9 (version: 6.9) with subvol /srv/brick06 [2020-10-23 11:44:23.386722] I [rpcsvc.c:864:rpcsvc_handle_rpc_call] 0-rpc-service: spawned a request handler thread for queue 1 [2020-10-23 11:44:23.388107] I [addr.c:52:compare_addr_and_update] 0-/srv/brick06: allowed = "", received addr = "192.168.0.27" [2020-10-23 11:44:23.388124] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:c741e6ef-5725-4f23-b4c3-9b7f94be1a01-GRAPH_ID:4-PID:1519-HOST:verirack-PC_NAME:gluvol1-client-5-RECON_NO:-7 (version: 6.9) with subvol /srv/brick06 [2020-10-23 11:44:23.401629] I [addr.c:52:compare_addr_and_update] 0-/srv/brick06: allowed = "", received addr = "192.168.0.19" [2020-10-23 11:44:23.401649] I [login.c:109:gf_auth] 0-auth/login: allowed user names: 6599e942-63ee-451c-8c31-766bb05ac0c2 [2020-10-23 11:44:23.401665] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:33c6affe-ce70-4122-9658-289a4ebf0420-GRAPH_ID:0-PID:53485-HOST:verijolt-PC_NAME:gluvol1-client-5-RECON_NO:-0 (version: 7.8) with subvol /srv/brick06 [2020-10-23 11:44:23.666142] I [addr.c:52:compare_addr_and_update] 0-/srv/brick06: allowed = "", received addr = "192.168.0.18" [2020-10-23 11:44:23.666179] I [login.c:109:gf_auth] 0-auth/login: allowed user names: 6599e942-63ee-451c-8c31-766bb05ac0c2 [2020-10-23 11:44:23.666194] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:76a4bc50-328e-45dd-9cdf-99329aec9ad2-GRAPH_ID:0-PID:1277-HOST:veriicon-PC_NAME:gluvol1-client-5-RECON_NO:-12 (version: 7.8) with subvol /srv/brick06 [2020-10-23 11:44:23.669918] I [addr.c:52:compare_addr_and_update] 0-/srv/brick06: allowed = "", received addr = "192.168.0.19" [2020-10-23 11:44:23.669932] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:e0928c38-c13e-4ab5-8d43-b26e397967ee-GRAPH_ID:0-PID:1303-HOST:verijolt-PC_NAME:gluvol1-client-5-RECON_NO:-3 (version: 7.8) with subvol /srv/brick06 [2020-10-23 11:44:23.686523] I [addr.c:52:compare_addr_and_update] 0-/srv/brick06: allowed = "", received addr = "192.168.0.10" [2020-10-23 11:44:23.686538] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:fd9a184d-61a0-41ae-bfc5-83c67a7a0497-GRAPH_ID:4-PID:1044-HOST:verialto-PC_NAME:gluvol1-client-5-RECON_NO:-12 (version: 7.7) with subvol /srv/brick06 [2020-10-23 11:44:23.704923] I [addr.c:52:compare_addr_and_update] 0-/srv/brick06: allowed = "", received addr = "192.168.0.12" [2020-10-23 11:44:23.704939] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:36ddaf3e-f7e2-4224-a9b8-5b1a6de94a8e-GRAPH_ID:0-PID:1282-HOST:vericalm-PC_NAME:gluvol1-client-5-RECON_NO:-10 (version: 7.8) with subvol /srv/brick06 [2020-10-23 11:44:23.705368] I [addr.c:52:compare_addr_and_update] 0-/srv/brick06: allowed = "", received addr = "192.168.0.17" [2020-10-23 11:44:23.705385] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:7bee7605-6be2-41b3-8fc4-1fc289bade00-GRAPH_ID:4-PID:1030-HOST:veriheat-PC_NAME:gluvol1-client-5-RECON_NO:-12 (version: 7.7) with subvol /srv/brick06 [2020-10-23 11:44:23.777276] I [addr.c:52:compare_addr_and_update] 0-/srv/brick06: allowed = "", received addr = "192.168.0.11" [2020-10-23 11:44:23.777291] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:eeda6a67-5c53-44f7-ba1a-b366adaf97fe-GRAPH_ID:4-PID:1240-HOST:veriblob-PC_NAME:gluvol1-client-5-RECON_NO:-10 (version: 7.7) with subvol /srv/brick06 [2020-10-23 11:44:25.426054] I [addr.c:52:compare_addr_and_update] 0-/srv/brick06: allowed = "", received addr = "192.168.0.19" [2020-10-23 11:44:25.426068] I [login.c:109:gf_auth] 0-auth/login: allowed user names: 6599e942-63ee-451c-8c31-766bb05ac0c2 [2020-10-23 11:44:25.426076] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:21d955c2-e8fd-4a41-847a-38729647bd8e-GRAPH_ID:4-PID:1969-HOST:verijolt-PC_NAME:gluvol1-client-5-RECON_NO:-0 (version: 7.8) with subvol /srv/brick06 [2020-10-23 11:44:25.434554] I [addr.c:52:compare_addr_and_update] 0-/srv/brick06: allowed = "", received addr = "192.168.0.19" [2020-10-23 11:44:25.434571] I [login.c:109:gf_auth] 0-auth/login: allowed user names: 6599e942-63ee-451c-8c31-766bb05ac0c2 [2020-10-23 11:44:25.434579] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:cff7c9fe-fbdc-445d-b1c8-c068493de96b-GRAPH_ID:0-PID:53517-HOST:verijolt-PC_NAME:gluvol1-client-5-RECON_NO:-0 (version: 7.8) with subvol /srv/brick06 [2020-10-23 11:44:27.440119] I [addr.c:52:compare_addr_and_update] 0-/srv/brick06: allowed = "", received addr = "192.168.0.19" [2020-10-23 11:44:27.440211] I [login.c:109:gf_auth] 0-auth/login: allowed user names: 6599e942-63ee-451c-8c31-766bb05ac0c2 [2020-10-23 11:44:27.440221] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:35535c81-dd54-497c-bfe1-a5784f1c0e65-GRAPH_ID:0-PID:53546-HOST:verijolt-PC_NAME:gluvol1-client-5-RECON_NO:-0 (version: 7.8) with subvol /srv/brick06
For specific to heal we need to check glustershd.log and heal logs glfsheal-gluvol1.log
@mohit84 thanks for continuing to try and help me out here. There are two things:
Is there anything you can suggest that I might try for the latter?
Below are the latest brick logs for brick06, i am not seeing any issue in brick logs.
My fault. These log files are too big for 10MB limit for pasting into this ticket. I stripped by date, not realising the crash dumps are not prefixed by date
[2020-10-23 11:44:27.440211] I [login.c:109:gf_auth] 0-auth/login: allowed user names: 6599e942-63ee-451c-8c31-766bb05ac0c2
[2020-10-23 11:44:27.440221] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-gluvol1-server: accepted client from CTX_ID:35535c81-dd54-497c-bfe1-a5784f1c0e65-GRAPH_ID:0-PID:53546-HOST:verijolt-PC_NAME:gluvol1-client-5-RECON_NO:-0 (version: 7.8) with subvol /srv/brick06
pending frames:
frame : type(1) op(FSETXATTR)
frame : type(1) op(LOOKUP)
frame : type(1) op(OPEN)
frame : type(1) op(READ)
frame : type(0) op(0)
frame : type(1) op(READ)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
I'll see if I can attach a more complete tail of that log file.
Can you explain why brick06 appears to be online, yet this log shows it has crashed? Any suggestions for moving this to a working state?
For specific to heal we need to check glustershd.log and heal logs glfsheal-gluvol1.log
dnf info glusterfs-debuginfo Last metadata expiration check: 0:37:38 ago on Fri 23 Oct 2020 14:14:36 BST. Error: No matching Packages to list
You need to run this to install the debug information (assuming you are running latest version):
dnf debuginfo-install glusterfs-server-7.8-1.fc32.x86_64
Before attach a core with gdb please install debug package of glusterfs.
- gdb /usr/sbin/glusterfsd -c
- Run thread apply all bt full and share the output.
I had to do a load of installs:
dnf debuginfo-install glusterfs-server
dnf install -y gdb
dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/23/63b517080071fe0d9871c7c1534df99fd7f970.debug
dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/39/2de5e09ed27bf2fe1722c0198295777db75ef5.debug
dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/63/debfea3b4768cdcfb082e38cd754688642b1ec.debug
dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/99/1df1f4a01379a1fd494b9a8fc104c0f02e2a5e.debug
dnf debuginfo-install glibc-2.31-4.fc32.x86_64 keyutils-libs-1.6-4.fc32.x86_64 krb5-libs-1.18.2-22.fc32.x86_64 libacl-2.2.53-5.fc32.x86_64 libaio-0.3.111-7.fc32.x86_64 libattr-2.4.48-8.fc32.x86_64 libcom_err-1.45.5-3.fc32.x86_64 libgcc-10.2.1-5.fc32.x86_64 libselinux-3.0-5.fc32.x86_64 libtirpc-1.2.6-1.rc4.fc32.x86_64 libuuid-2.35.2-1.fc32.x86_64 openssl-libs-1.1.1g-1.fc32.x86_64 pcre2-10.35-7.fc32.x86_64 sssd-client-2.4.0-1.fc32.x86_64 zlib-1.2.11-21.fc32.x86_64
I think a found the core dump
I then ran
gdb /usr/sbin/glusterfsd -c /var/spool/abrt/ccpp-2020-10-22-14:23:50.21866-346125/coredump \
-ex "set logging file gdb_glusterfsd_gluvol1_verijolt_srv-brick06.log" -ex "set logging on"
giving the following on screen output
where I entered
thread apply all bt full
Thanks @mohit84 and @xhernandez I hope someone can make sense of the coredump info from gdb.
Meanwhile, does anyone have any suggestions for nudging my brick06 back into life? (The HDD itself looks fine, no SMART errors, and disk usage looks comparable with the working brick06 on veriicon:
)
Hi, Thanks for sharing the coredump. It seems the file is having huge xattr on backend that's why size is showing 263507(257k) and the function calls in iot_worker thread, iot_worker thread stack size is 256k that's why a brick process is crashed.
Any idea why you did create so many xattrs on the backend ?
Thread 1 (Thread 0x7f880c684700 (LWP 346228)):
#0 0x00007f8814458729 in posix_get_ancestry_non_directory (this=this@entry=0x7f8804008930, leaf_inode=<optimized out>, head=head@entry=0x7f880c682b50, path=path@entry=0x0, type=type@entry=2, op_errno=op_errno@entry=0x7f880c682b4c, xdata=<optimized out>) at posix-inode-fd-ops.c:3218
remaining_size = 0
op_ret = <optimized out>
pathlen = -1
handle_size = 0
pgfid = '\000' <repeats 15 times>
nlink_samepgfid = 0
stbuf = {st_dev = 0, st_ino = 0, st_nlink = 0, st_mode = 0, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 0, st_blksize = 0, st_blocks = 0, st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 0}, st_ctim = {tv_sec = 0, tv_nsec = 0}, __glibc_reserved = {0, 0, 0}}
list = 0x0
list_offset = 0
priv = <optimized out>
size = 263507
parent = 0x0
loc = 0x7f87f407ed20
leaf_path = <optimized out>
key = '\000' <repeats 4095 times>
dirpath = '\000' <repeats 4095 times>
pgfidstr = '\000' <repeats 36 times>
len = <optimized out>
__FUNCTION__ = "posix_get_ancestry_non_directory"
#1 0x00007f8814458c9f in posix_get_ancestry (this=this@entry=0x7f8804008930, leaf_inode=<optimized out>, head=head@entry=0x7f880c682b50, path=path@entry=0x0, type=type@entry=2, op_errno=op_errno@entry=0x7f880c682b4c, xdata=0x7f87f4082778) at posix-inode-fd-ops.c:3316
ret = -1
priv = <optimized out>
#2 0x00007f8814461c20 in posix_readdirp (frame=0x7f87f407d0f8, this=0x7f8804008930, fd=0x7f87f4081ca8, size=140222300433232, off=0, dict=0x7f87f4082778) at posix-inode-fd-ops.c:5630
entries = {{list = {next = 0x7f880c682b50, prev = 0x7f880c682b50}, {next = 0x7f880c682b50, prev = 0x7f880c682b50}}, d_ino = 2, d_off = 0, d_len = 3623899760, d_type = 32647, d_stat = {ia_flags = 140222300433296, ia_ino = 140222159465296, ia_dev = 140222518552358, ia_rdev = 206158430232, ia_size = 10846431729114805760, ia_nlink = 208153600, ia_uid = 32648, ia_gid = 3597268480, ia_blksize = 2525381680, ia_blocks = 140221891468232, ia_atime = 0, ia_mtime = 140221891408544, ia_ctime = 140222518550629, ia_btime = 25, ia_atime_nsec = 426807549, ia_mtime_nsec = 32648, ia_ctime_nsec = 4093651256, ia_btime_nsec = 32647, ia_attributes = 10846431729114805760, ia_attributes_mask = 140221890963032, ia_gfid = "8)\000\364\207\177\000\000\001\000\000\000\000\000\000", ia_type = IA_INVAL, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}}, dict = 0x7f8804008930, inode = 0x7f87f407d0f8, d_name = 0x7f880c682c20 "\250\034\b\364\207\177"}
op_ret = -1
op_errno = 0
entry = 0x0
__FUNCTION__ = "posix_readdirp"
#3 0x00007f88196ec90b in default_readdirp (frame=0x7f87f407d0f8, this=<optimized out>, fd=0x7f87f4081ca8, size=0, off=0, xdata=0x7f87f4082778) at defaults.c:2966
old_THIS = 0x7f880400c350
next_xl = 0x7f8804008930
next_xl_fn = <optimized out>
__FUNCTION__ = "default_readdirp"
#4 0x00007f88196ec90b in default_readdirp (frame=frame@entry=0x7f87f407d0f8, this=<optimized out>, fd=fd@entry=0x7f87f4081ca8, size=size@entry=0, off=off@entry=0, xdata=xdata@entry=0x7f87f4082778) at defaults.c:2966
old_THIS = 0x7f880400e3f0
next_xl = 0x7f880400c350
next_xl_fn = <optimized out>
__FUNCTION__ = "default_readdirp"
#5 0x00007f881436e929 in br_stub_readdirp (frame=frame@entry=0x7f87f407dbc8, this=0x7f8804010c30, fd=fd@entry=0x7f87f4081ca8, size=size@entry=0, offset=offset@entry=0, dict=dict@entry=0x7f87f4082778) at bit-rot-stub.c:2898
_new = 0x7f87f407d0f8
old_THIS = 0x7f8804010c30
next_xl_fn = 0x7f88196ec830 <default_readdirp>
tmp_cbk = 0x7f88143730d0 <br_stub_readdirp_cbk>
ret = <optimized out>
op_errno = <optimized out>
xref = <optimized out>
priv = <optimized out>
__FUNCTION__ = "br_stub_readdirp"
#6 0x00007f88143577d2 in posix_acl_readdirp (frame=frame@entry=0x7f87f407e698, this=0x7f8804012b50, fd=fd@entry=0x7f87f4081ca8, size=size@entry=0, offset=offset@entry=0, dict=dict@entry=0x7f87f4082778) at posix-acl.c:1648
_new = 0x7f87f407dbc8
old_THIS = 0x7f8804012b50
next_xl_fn = 0x7f881436e530 <br_stub_readdirp>
tmp_cbk = 0x7f881435af30 <posix_acl_readdirp_cbk>
ret = <optimized out>
alloc_dict = <optimized out>
__FUNCTION__ = "posix_acl_readdirp"
#7 0x00007f88143166d0 in pl_readdirp (frame=0x7f87f407f168, this=0x7f8804014790, fd=0x7f87f4081ca8, size=0, offset=0, xdata=0x7f87f4082778) at posix.c:3046
_new = 0x7f87f407e698
old_THIS = <optimized out>
next_xl_fn = 0x7f8814357560 <posix_acl_readdirp>
tmp_cbk = 0x7f881431ef70 <pl_readdirp_cbk>
__FUNCTION__ = "pl_readdirp"
#8 0x00007f88196ec90b in default_readdirp (frame=0x7f87f407f168, this=<optimized out>, fd=0x7f87f4081ca8, size=0, off=0, xdata=0x7f87f4082778) at defaults.c:2966
old_THIS = 0x7f88040163b0
next_xl = 0x7f8804014790
next_xl_fn = <optimized out>
__FUNCTION__ = "default_readdirp"
#9 0x00007f88196ec90b in default_readdirp (frame=0x7f87f407f168, this=<optimized out>, fd=0x7f87f4081ca8, size=0, off=0, xdata=0x7f87f4082778) at defaults.c:2966
old_THIS = 0x7f8804018490
next_xl = 0x7f88040163b0
next_xl_fn = <optimized out>
__FUNCTION__ = "default_readdirp"
#10 0x00007f88196ec90b in default_readdirp (frame=frame@entry=0x7f87f407f168, this=<optimized out>, fd=fd@entry=0x7f87f4081ca8, size=size@entry=0, off=off@entry=0, xdata=xdata@entry=0x7f87f4082778) at defaults.c:2966
old_THIS = 0x7f880401a130
next_xl = 0x7f8804018490
next_xl_fn = <optimized out>
__FUNCTION__ = "default_readdirp"
#11 0x00007f88142c51f1 in up_readdirp (frame=frame@entry=0x7f87f4089e68, this=0x7f880401be00, fd=fd@entry=0x7f87f4081ca8, size=size@entry=0, off=off@entry=0, dict=dict@entry=0x7f87f4082778) at upcall.c:1324
_new = 0x7f87f407f168
old_THIS = 0x7f880401be00
next_xl_fn = 0x7f88196ec830 <default_readdirp>
tmp_cbk = 0x7f88142bc820 <up_readdirp_cbk>
local = <optimized out>
__FUNCTION__ = "up_readdirp"
#12 0x00007f88197051bd in default_readdirp_resume (frame=0x7f87f407fc38, this=0x7f880401da80, fd=0x7f87f4081ca8, size=0, off=0, xdata=0x7f87f4082778) at defaults.c:2169
_new = 0x7f87f4089e68
old_THIS = <optimized out>
next_xl_fn = 0x7f88142c4fc0 <up_readdirp>
tmp_cbk = 0x7f88196e6760 <default_readdirp_cbk>
__FUNCTION__ = "default_readdirp_resume"
#13 0x00007f8819683035 in call_resume (stub=0x7f87f409cea8) at call-stub.c:2555
old_THIS = 0x7f880401da80
__FUNCTION__ = "call_resume"
#14 0x00007f88142ad128 in iot_worker (data=0x7f8804056060) at io-threads.c:232
conf = 0x7f8804056060
this = 0x7f880401da80
stub = 0x7f87f409cea8
sleep_till = {tv_sec = 1603373148, tv_nsec = 770933280}
ret = <optimized out>
pri = 0
bye = <optimized out>
__FUNCTION__ = "iot_worker"
#15 0x00007f8819582432 in start_thread (arg=<optimized out>) at pthread_create.c:477
ret = <optimized out>
pd = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140222300440320, 3244609149766524837, 140222428397518, 140222428397519, 140222159741384, 140222300440320, -3308120404764074075, -3308148579319488603}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = 0
#16 0x00007f88194ae913 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.
Any idea why you did create so many xattrs on the backend ?
A 12TB RAID5 box died (either/both controller and power supply), but the 5 HDD's are ok. I am painstakingly restoring the data from the HDDs onto a gluster volume. I am confident that I am getting this right because of good parity across the HDDs and consistent checksums on a file by file basis. The data on this box was an rsnapshot backup so it contains a lot of hard links. I have got to the stage where the data files, including all the hard links (millions of them) are restored ok, but the permissions, ownership and timestamps of the dirs and files restored onto gluster are incorrect. Hence I have scripted changing these.
It is conceivable that these scripts, full of chmod,chown and touch for each file in turn, place a burden on gluster. I stated in first (submission) comment at the top that this was a possible cause.
If running such a script does "create many xattrs on the backend" then this is a likely cause. _(I thought xattrs were to do with the gluster management, not the normal file attributes for permissions, ownership and timestamps. And by backend do you mean the gluster server?)_
Why has only one brick crashed? Why was it fine for 5 hours or so?
If this is the cause, then once my gluster volume is back to normal (brick06 on verijolt properly online), then I can break up my restore into more manageable chunks. This is a one-off exercise, I will not and do not want to be doing this again!
Given you have a clue as to the cause, how would you suggest I bring brick06 on verijolt back to life?
The fastest way I see to fix this is to identify the file that has so many extended attributes and remove/clear them.
To do that, inside gdb, can you execute this:
x/16bx loc->gfid
This will return 16 hexadecimal numbers, like this:
0x7f24df07b034: 0xXX 0xYY 0x-- 0x-- 0x-- 0x-- 0x-- 0x--
0x7f24df07b03c: 0x-- 0x-- 0x-- 0x-- 0x-- 0x-- 0x-- 0x--
You need to take the first two values and go to this directory in server verijolt:
/srv/brick06/.glusterfs/XX/YY
You should find a file there that has all the 16 numbers returned by gdb (with some '-' in the middle).
Once you identify the file, you need to execute this command:
# getfattr -m. -e hex -d <file>
Depending on what this returns, we'll decide how to proceed.
@xhernandez, very clear instructions, thank you.
x/16bx loc->gfid
0x7f87f407ed40: 0xeb 0xc3 0xf6 0x06 0xbc 0x70 0x4e 0x67
0x7f87f407ed48: 0xb3 0xc0 0xcd 0x5e 0x95 0x01 0xae 0x2c
ls -l /srv/brick06/.glusterfs/eb/c3/ebc3f606*
-rwxr-xr-x 70 bobw root 43571 2019-08-02 09:15 /srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c
getfattr -m . -e hex -d /srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c
/srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c: Argument list too long
ls -l /srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c
-rwxr-xr-x 70 bobw root 43571 2019-08-02 09:15 /srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c
getfattr '/srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c'
/srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c: Argument list too long
file /srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c
/srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c: Lua bytecode, version 5.1
My interpretation of this response "Argument list too long", is not that the argument list passed to getfattr is too long, but rather that the number of attributes, or something similar, is too long for getfattr to understand.
The response from a random other file in this directory appears fine
getfattr -m . -e hex -d /srv/brick06/.glusterfs/eb/c3/ebc31332-4198-488e-baca-b59b291fbdd7
getfattr: Removing leading '/' from absolute path names
# file: srv/brick06/.glusterfs/eb/c3/ebc31332-4198-488e-baca-b59b291fbdd7
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x12000000000000005f8c1ab4000ebf2a
trusted.gfid=0xebc313324198488ebacab59b291fbdd7
trusted.gfid2path.caf84fba69486ea6=0x38626639666662342d613637312d343039612d623238652d6439386261343962656338352f77646d6175642e737973
trusted.glusterfs.mdata=0x010000000000000000000000005e6e6ef3000000000fe52c49000000003ef209600000000000000000000000005e6e6ef3000000000fa47fa7
trusted.glusterfs.quota.8bf9ffb4-a671-409a-b28e-d98ba49bec85.contri.2=0x00000000000122000000000000000001
trusted.pgfid.8bf9ffb4-a671-409a-b28e-d98ba49bec85=0x00000001
I see that "Argument list too long" is sometimes associated with "The maximum size extended attribute the Linux kernel can support is 64kb."
So clearly there is something wrong with this file. What could gluster (or something else) have done that could have corrupted this file?
What is the next step you might propose?
I am also interested in the possibility that the crash was caused by one file, as opposed to a backlog accumulation of changes required to many files. @mohit84 wrote
Any idea why you did create so many xattrs on the backend ?
and I think I may have missunderstood or been distracted by "the backend". What does "backend" refer to? Is it just one file in the .glusterfs/XX/YY directory, or is it a collective term for something else?
I see that the file has 70 hardlinks. Gluster keeps special xattrs per hardlink, but they don't use so much space.
If you are using XFS for the bricks, can you execute this command ?
xfs_db -r $(df /srv/brick06/ | awk '{ print $1; }' | tail -1) -c "inode $(ls -i /srv/brick06/.glusterfs/eb/c3/ebc31332-4198-488e-baca-b59b291fbdd7 | awk '{ print $1; }')" -c "p" -c "ablock 0" -c "p" -c "btdump"
This should show all extended attributes.
You can also check the same file in the other server (veriicon) and see if it has good data or not.
For now the only way to prevent a crash is to remove that file (assuming there aren't other files in the same condition), but to do so you will also need to identify and delete all hardlinks.
You can do that using this command:
find /srv/brick06/ -inum $(ls -i /srv/brick06/.glusterfs/eb/c3/ebc31332-4198-488e-baca-b59b291fbdd7 | awk '{ print $1; }')
Once listed, verify that the file is healthy in the other server (otherwise you will lose it). If it's ok in the other server, you can delete all entries returned by the find command (or move them elsewhere).
Once the entries are not present in the brick, you can try restarting it and see if it doesn't crash this time. If it crashes, you will need to identify which file is failing now by repeating the steps from previous comments.
@xhernandez thanks for your continued support.
I use btrfs rather than xfs.
I have a more pressing problem right now: files are going missing from the gluster view. This is the HA disaster that I thought gluster would avoid. Panic!
I need to take some drastic steps, like shutting down one of the servers, but I don't know how to handle this situation. Grrr.
My gluster system has not been at all happy. It looks like files are going missing (as viewed from a client). Consequently clients crashed, locked up, or exhibited undesirable behaviour. All my notes, journalling, scripts etc., are held on gluster, so when gluster goes down things get very difficult.
Fortunately, gluster volume start gluvol1 force
has been enough to get some sort of normality, but I am extremely nervous and I would really appreciate your help in bringing my gluster system back to a stable and safe state.
It seems that the brick log file records when something catastrophic has gone wrong, so I searched all the log files
on verijolt
awk '/signal received/{sig=$0;l=1;next}(l){l++}(l==3)&&/^20/{printf(" %s %s %s\n",$0,sig,FILENAME)} \
' /var/log/glusterfs/bricks/srv-brick??.log
2020-10-22 12:16:28 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-22 14:09:37 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-22 14:21:23 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-22 14:24:19 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-22 14:27:16 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-22 14:39:00 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-22 16:25:11 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-22 17:04:34 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-23 00:25:06 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-23 00:29:13 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-23 22:55:11 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-23 23:11:06 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-24 11:47:37 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-23 22:00:52 signal received: 11 /var/log/glusterfs/bricks/srv-brick07.log
on veriicon
2020-10-23 22:00:50 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-23 23:06:03 signal received: 11 /var/log/glusterfs/bricks/srv-brick06.log
2020-10-23 22:00:53 signal received: 11 /var/log/glusterfs/bricks/srv-brick07.log
As you can see, brick06 and brick07 on both veriicon and verijolt (the two replica 2 servers) have been affected.
The failure mechanism appears to be different from the huge xattr problem analysed above, so I am adding gdb output from each of the servers.
Failure on veriicon brick06
Failure on verijolt brick06
For now the only way to prevent a crash is to remove that file (assuming there aren't other files in the same condition), but to do so you will also need to identify and delete all hardlinks.
There are 70 hardlinks on the offending file:
getfattr /srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c \
2>&1 | awk '{print " " $0}'; date +\ \ %F\ %T%n
/srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c: Argument list too long
2020-10-24 09:52:56
ls -li /srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c \
2>&1 | awk '{print " " $0}'; date +\ \ %F\ %T%n
12698422 -rwxr-xr-x 70 bobw root 43571 2019-08-02 09:15 /srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c
2020-10-24 09:55:09
export start_date=`date +%F\ %T`; \
find /srv/brick06 -inum 12698422 \
2>&1 | awk '{print " " $0}'; echo " ${start_date}" && date +\ \ %F\ %T%n
/srv/brick06/.glusterfs/eb/c3/ebc3f606-bc70-4e67-b3c0-cd5e9501ae2c
...
2020-10-24 09:58:25
2020-10-24 10:16:56
Notice it took nearly 20 minutes to find them!
This "Argument list too long" response to getfattr is also present on the equivalent file on veriicon.
I elected to delete these 70 files on both servers. I then went through a process of gluster volume start gluvol1 force
and rebooting each server in turn. I now have a working gluster system with all bricks online. And I have copied back the 70 files that I had deleted.
Unsurprisingly, there is a massive healing backlog:
# Healing
server brick: 00 01 02 03 04 05 06 07
veriicon pending: 0 0 0 0 0 0 80283 0
verijolt pending: 0 0 0 0 0 0 223 0
veriicon split: 0 0 0 0 0 0 66 0
verijolt split: 0 0 0 0 0 0 64 0
veriicon healing: 0 0 0 0 0 0 0 0
verijolt healing: 0 0 0 0 0 0 0 0
2020-10-24 18:21:19
I manually cleared the split-brain issues, and several hours later, most of the self-healing had completed, I am now left with just one:
# Healing
server brick: 00 01 02 03 04 05 06 07
veriicon pending: 0 0 0 0 0 0 1 0
verijolt pending: 0 0 0 0 0 0 0 0
veriicon split: 0 0 0 0 0 0 0 0
verijolt split: 0 0 0 0 0 0 0 0
veriicon healing: 0 0 0 0 0 0 0 0
verijolt healing: 0 0 0 0 0 0 0 0
2020-10-25 18:06:46
Gluster has been attempting to heal this file for over 24 hours:
[2020-10-24 17:03:51.490279] W [MSGID: 114031] [client-rpc-fops_v2.c:915:client4_0_getxattr_cbk] 2-gluvol1-client-4: remote operation failed. Path: <gfid:3088bdfc-7680-4b7c-8ff3-968c0ae6a572> (3088bdfc-7680-4b7c-8ff3-968c0ae6a572). Key: (null) [Argument list too long]
[2020-10-24 18:46:28.885450] W [MSGID: 114031] [client-rpc-fops_v2.c:915:client4_0_getxattr_cbk] 2-gluvol1-client-4: remote operation failed. Path: <gfid:3088bdfc-7680-4b7c-8ff3-968c0ae6a572> (3088bdfc-7680-4b7c-8ff3-968c0ae6a572). Key: (null) [Argument list too long]
[2020-10-24 18:46:37.022954] W [MSGID: 114031] [client-rpc-fops_v2.c:915:client4_0_getxattr_cbk] 2-gluvol1-client-4: remote operation failed. Path: <gfid:3088bdfc-7680-4b7c-8ff3-968c0ae6a572> (3088bdfc-7680-4b7c-8ff3-968c0ae6a572). Key: (null) [Argument list too long]
[2020-10-24 18:56:43.233457] W [MSGID: 114031] [client-rpc-fops_v2.c:915:client4_0_getxattr_cbk] 2-gluvol1-client-4: remote operation failed. Path: <gfid:3088bdfc-7680-4b7c-8ff3-968c0ae6a572> (3088bdfc-7680-4b7c-8ff3-968c0ae6a572). Key: (null) [Argument list too long]
...
[2020-10-25 17:58:36.353516] I [MSGID: 108026] [afr-self-heal-metadata.c:51:__afr_selfheal_metadata_do] 2-gluvol1-replicate-0: performing metadata selfheal on 3088bdfc-7680-4b7c-8ff3-968c0ae6a572
[2020-10-25 17:58:36.354143] W [MSGID: 114031] [client-rpc-fops_v2.c:915:client4_0_getxattr_cbk] 2-gluvol1-client-4: remote operation failed. Path: <gfid:3088bdfc-7680-4b7c-8ff3-968c0ae6a572> (3088bdfc-7680-4b7c-8ff3-968c0ae6a572). Key: (null) [Argument list too long]
[2020-10-25 18:08:37.100701] I [MSGID: 108026] [afr-self-heal-metadata.c:51:__afr_selfheal_metadata_do] 2-gluvol1-replicate-0: performing metadata selfheal on 3088bdfc-7680-4b7c-8ff3-968c0ae6a572
[2020-10-25 18:08:37.101187] W [MSGID: 114031] [client-rpc-fops_v2.c:915:client4_0_getxattr_cbk] 2-gluvol1-client-4: remote operation failed. Path: <gfid:3088bdfc-7680-4b7c-8ff3-968c0ae6a572> (3088bdfc-7680-4b7c-8ff3-968c0ae6a572). Key: (null) [Argument list too long]
I think this is a separate and unrelated bug which I believe is caused by the fact that the file(s) being healed has over 800 hardlinks.
export start_date=`date +%F\ %T`; \
gluster volume heal gluvol1 info \
2>&1 | awk '{print " " $0}'; echo " ${start_date}" && date +\ \ %F\ %T
Brick veriicon:/srv/brick06
<gfid:3088bdfc-7680-4b7c-8ff3-968c0ae6a572>
Status: Connected
Number of entries: 1
...
ls -l /srv/brick06/.glusterfs/30/88/3088bdfc-7680-4b7c-8ff3-968c0ae6a572
-rwxr-xr-x 809 root root 38099 2019-02-21 08:14 /srv/brick06/.glusterfs/30/88/3088bdfc-7680-4b7c-8ff3-968c0ae6a572
Of course, the reboots (or is it the gluster volume start gluvol1 force
) have caused a massive resigning process, and this is still running:
# Scanning for unsigned objects
2020-10-24 16:39:36 00 Crawling brick [/srv/brick00], scanning for unsigned objects
2020-10-24 16:39:37 00 Completed crawling brick [/srv/brick00] for 180 objects
2020-10-24 16:39:38 01 Crawling brick [/srv/brick01], scanning for unsigned objects
2020-10-24 16:39:44 01 Completed crawling brick [/srv/brick01] for 607 objects
2020-10-24 16:39:34 04 Crawling brick [/srv/brick04], scanning for unsigned objects
2020-10-24 16:40:01 04 Completed crawling brick [/srv/brick04] for 1364 objects
2020-10-24 16:39:36 05 Crawling brick [/srv/brick05], scanning for unsigned objects
2020-10-25 03:25:37 05 Completed crawling brick [/srv/brick05] for 863130 objects
2020-10-25 01:00:52 06 Crawling brick [/srv/brick06], scanning for unsigned objects
... 06 Triggering signing for 2414900 objects
2020-10-24 16:39:38 07 Crawling brick [/srv/brick07], scanning for unsigned objects
... 07 Triggering signing for 1729981 objects
2020-10-25 17:59:36
Although I have enough of my gluster system to continue normal usage, I feel it is somewhat compromised.
Would it be possible summarise what went wrong that essentially made gluster unstable?
I believe in, support and use open source software. Because of peer review of source code, user feedback and so on, such software should become better than commercial equivalents. Hence I report bugs in the interest of making open source products the best and to prevent others being affected by such bugs. I am always grateful to coders/contributers who analyse and fix bugs.
To prevent recurrence of an issue you can configure the option "storage.max-hardlinks" to least value sothat client wont be able to create a hardlink if limit has been crossed.
To prevent recurrence of an issue you can configure the option "storage.max-hardlinks" to least value sothat client wont be able to create a hardlink if limit has been crossed.
Should we have a default value for this option? Say 42 (ie, a sane random). That way, we can prevent a bad experience which Bockerman got into by prompting an error to application much earlier. After that, they can decide if the value needs to be increased or not depending on their usecase.
My suggestion is however restrictive, we should keep default options which prevents any borderline issues like this, and makes sure glusterfs provides good performance, and stability. Users can alter the options only knowing what their usecase is, and that should be allowed, as they will be responsible for that particular usecase.
To prevent recurrence of an issue you can configure the option "storage.max-hardlinks" to least value sothat client wont be able to create a hardlink if limit has been crossed.
Won't that make the application unusable on glusterfs?
I don't see the value of this option. If we set it to a lower value, the application creating hardlinks will fail. If the user increases the value (because they are actually creating more hardlinks) we'll have a crash. And even worse, self-heal is unable to heal those files because the size of the xattrs is too big.
What we need to do in this case is to disable gfid2path and any other feature that requires per-hardlink data (besides fixing the crash, of course). Even if we fix the crash and make it possible to handle hundreds of xattrs, it will be very bad from a performance point of view.
The current default value is 100 and in xfs i have tried to create 100 hardlink i am not able to create more than 47 hard-link.After reaching the number of hardlink count is 47 setxattr throwing an error "No space left on device".
{.key = {"max-hardlinks"}, .type = GF_OPTION_TYPE_INT, .min = 0, .default_value = "100", .op_version = {GD_OP_VERSION_4_0_0}, .flags = OPT_FLAG_SETTABLE | OPT_FLAG_DOC, .tags = {"posix"}, .validate = GF_OPT_VALIDATE_MIN, .description = "max number of hardlinks allowed on any one inode.\n" "0 is unlimited, 1 prevents any hardlinking at all."},
I think we need to restrict the maximum value of max-hardlinks, i don;t think after restrict/confifure the max-hardlink application won't be able to use glusterfs.
Would it be possible summarise what went wrong that essentially made gluster unstable?
The segmentation fault happens because we use a stack allocated buffer to store the the contents of the xattrs. This is done in two steps: first we get the needed size, and then we allocate a buffer of that size to store the data. The problem happens because of the combination of 2 things:
This causes a segmentation fault when trying to allocate more space than available from stack.
- Is there a set of preventative measures that could be adopted to prevent reoccurance?
In your particular case I would recommend to disable gfid2path feature. You also seem to be using quota. Quota works on a per-directory basis, but given you have multiple hardlinks, I'm not sure if it makes sense (to which directory the quota should be accounted for ?). If not strictly necessary, I would also disable quota.
This doesn't fix the existing issues. Disabling gfid2path will prevent creating additional xattrs for newer files of new hardlinks, but it won't delete existing ones. We should sweep all the files of each brick and remove them. However standard tools (getfattr) doesn't seem to support big xatts either, so I'm not sure how to do that unless btrfs has some specific tool (I don't know btrfs)
- Are there more checks and safeguards that could/should be implemented in the code?
Sure. We'll need to improve that in current code, at least to avoid a crash and return an error if size of too big.
- Please could you suggest some steps that could be takeen to sanitse and discover things likely to cause failure including
As I've just commented, we should disable dome features and clean existing xattrs, but I don't know how to do that on btrfs if getfattr doesn't work.
- files that exist on more than one brick
You are using a replica. It's expected to have the same file in more than one brick.
- files with properties that break current limits (too many hardlinks, too many attributes, ...)
Probably any file that returns an error for getfattr will also have issues in Gluster.
- dangling gfid files, where the named file has been deleted directly from a brick, but not the corresponding gfid file
You should never do this. It can cause more troubles.
To find them, this command should work:
find <brick root>/.glusterfs/ -type f -links 1
Any file returned inside /.glusterfs/\<xx>/\<yy>/ with a single link could be removed (be careful to not do this when the volume has load. Otherwise find
could incorrectly detect files that are still being created but have not fully completed).
This won't find symbolic links that have been deleted.
The current default value is 100 and in xfs i have tried to create 100 hardlink i am not able to create more than 47 hard-link.After reaching the number of hardlink count is 47 setxattr throwing an error "No space left on device".
This is caused because you are using XFS and it limits the xattr size to 64KiB. It's not a limitation on the number of hardlinks. XFS and Gluster can create many more. But when we also add an xattr for each hardlink, the limit becomes the xattr size.
Apparently btrfs doesn't have a 64 KiB limit. That's why the issue has happened without detecting any error.
I think we need to restrict the maximum value of max-hardlinks, i don;t think after restrict/confifure the max-hardlink application won't be able to use glusterfs.
If the application needs to create 100 hardlinks, it won't work on Gluster if we don't allow more than 47. So the application won't be usable on Gluster.
The current default value is 100 and in xfs i have tried to create 100 hardlink i am not able to create more than 47 hard-link.After reaching the number of hardlink count is 47 setxattr throwing an error "No space left on device".
This is caused because you are using XFS and it limits the xattr size to 64KiB. It's not a limitation on the number of hardlinks. XFS and Gluster can create many more. But when we also add an xattr for each hardlink, the limit becomes the xattr size.
Apparently btrfs doesn't have a 64 KiB limit. That's why the issue has happened without detecting any error.
I think we need to restrict the maximum value of max-hardlinks, i don;t think after restrict/confifure the max-hardlink application won't be able to use glusterfs.
If the application needs to create 100 hardlinks, it won't work on Gluster if we don't allow more than 47. So the application won't be usable on Gluster.
Thanks for clarifying it more.
Much appreciate all comments above from @mohit84 @xhernandez @pranithk and @amarts. I have been beset with issues, most of which are off topic, but including local disks filling up (e.g. with 10G bitd.log). I intend to digest and respond to all comments.
@mohit84 wrote
To prevent recurrence of an issue you can configure the option "storage.max-hardlinks" to least value sothat client wont be able to create a hardlink if limit has been crossed.
More background info: In addition to using gluster to provide HA SAN storage for a variety of purposes including cloud storage using Nextcloud, there was a requirement to move the contents from several years of rsnapshot onto one of the volumes.
Prior to attempting to restore the rsnapshot
gluster volume get gluvol1 storage.max-hardlinks \
2>&1 | awk '{print " " $0}'; date +\ \ %F\ %T
Option Value
------ -----
storage.max-hardlinks 10000000
I had set max-hardlinks to an aggressively high number, based on the number of files to be transferred (i.e. 10,000,000 being more than 5,313,170) because, at the time I did not know the maximum number of hard links actually present (9,774).
Before the crash (brick06 offline, the topic of this issue) all the contents had been transferred onto gluster, apparently successfully because checksums matched the source. However, transferring this data took place over several weeks, not all in one go, and consequently not all of the hardlinks were preserved (not a requirement).
I am still checking that the data on the gluster target volume matches the source. So far I have not found any discrepancy (apart from differences in permissions, ownership and timestamps). So I am assuming that, in general, gluster is handling inodes with over 1,000 hard links. However, some operations, like healing one inode/file with 908 hardlinks is stuck.
Am I right to assume that "storage.max-hardlinks" being too low is not the cause of the problem and that having a higher value does nothing to prevent recurrence of the issue?
@amarts wrote
To prevent recurrence of an issue you can configure the option "storage.max-hardlinks" to least value sothat client wont be able to create a hardlink if limit has been crossed.
Should we have a default value for this option? Say 42 (ie, a sane random). That way, we can prevent a bad experience which Bockerman got into by prompting an error to application much earlier. After that, they can decide if the value needs to be increased or not depending on their usecase.
My suggestion is however restrictive, we should keep default options which prevents any borderline issues like this, and makes sure glusterfs provides good performance, and stability. Users can alter the options only knowing what their usecase is, and that should be allowed, as they will be responsible for that particular usecase.
I agree, my data set is unusual, and I do not mind having to set an option to override any default. However, I would like to be assured, that whatever limits there are, they are handled gracefully such that the user/admin can make adjustments before the risk of any corruption.
@pranithk wrote
Won't that make the application unusable on glusterfs?
I'm not sure what "application" you are imagining. Gluster is providing "storage", and if some of that storage contains backup or snapshot data, any user can read that, restore it to own user area, and run whatever application is desired.
My 11TB of data, some of which has more than 100 hardlinks per file/inode, appears to be usable.
Description of problem: One brick on one server is offline and all attempts to bring it back online have failed. The corresponding brick on the other (of a replica 2) server is ok. Other bricks are ok.
The following do not clear the problem:
The problem appears to be similar to https://github.com/gluster/glusterfs/issues/1531 but the cause is different, and the number of volumes and bricks is different. (I note the observation comment regarding "replica 2" and split-brain, but the cost (time/effort) to recover from split-brain is manageable and usually due to external causes, such as a power cut.)
My urgent need is to find a way out of the current situation and bring back online brick06 on the second server. Not so urgent is the need for gluster to handle this condition in a graceful way and report to the user/admin what is the real cause of the problem and how to fix it (if it cannot be fixed automatically).
The exact command to reproduce the issue: Not sure what actually caused this situation to arise, but activity at the time was: Multiple clients, all active, but with minimal activity. Intense activity from one client (actually one of the two gluster servers), scripted "chown" on over a million files which had been running for over 5 hours and was 83% complete. edit or "sed -i" on a 500MB script file (but should not have tipped over the 22GB Mem + 8 GB Swap)
The full output of the command that failed:
Expected results: Some way to bring that brick back online.
- The output of the
gluster volume info
command:- The operating system / glusterfs version: Fedora F32 Linux veriicon 5.8.15-201.fc32.x86_64 #1 SMP Thu Oct 15 15:56:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Linux verijolt 5.8.15-201.fc32.x86_64 #1 SMP Thu Oct 15 15:56:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux glusterfs 7.8
Additional info:
snippet from /var/log/messages
snippet from /var/log/glusterfs/bricks/srv-brick06.log