gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.73k stars 1.08k forks source link

Massive Latency spikes (resulting in trashing and requiring a force-remount to resolve) #3356

Closed jkroonza closed 2 years ago

jkroonza commented 2 years ago

Description of problem:

We run a maildir structure off of a glusterfs filesystem (to allow multiple servers for redundancy, 2 in this cse). This filesystem is laid out as a 2x2 replica-distribute. The underlying bricks resides on dedicated mounts, running off a PERC H700 RAID controller, with 2 x 4TB NL-SAS drives. The two primary servers handles ~150 courier-imapd processes at any point in time, the two backups ~15 each (normally none).

We are also wondering if global threading pools will help in any way. There are also considerations of too many threads can cause more lock contention than required (which global threading claims to reduce). Is there thread profiling available to determine and feedback as to options here?

From time to time IO operations via FUSE start taking excessively long (>50s), frequently resulting in cases where we need to kill all accessing (aften blocking) processes (they die with umount -fl assuming that a kill has been sent prior). There seems to be a timing correlation between different clients, but not always. The code used to measure this latency (max 50s before we force-remount currently, roughly corresponds with - logging portion only):

mp=/var/spool/mail
timeout=50

stime=$(date +%s.%3N)
touch "${mp}/.glustertest.$(hostname)" &
tpid=$!

sleep $timeout &
spid=$!

expid=
wait -fn -p expid
excode=$?

etime=$(date +%s.%3N)
dtime=$( bc <<<"$etime - $stime" | sed 's/^\./0./') # sed inserts leading 0 if needed.
echo "$stime $dtime" >> /var/log/gluster_timings${mp//\//_}.log

Reason for the wait complexity and not using timeout is because timeout often blocks for far longer than 50s due to the touch going into uninterruptable wait. We've tried not doing the remount at all, and things just gets worse from there (load averages starts going through the roof and clients starts complaining verbosely, so it really is the lesser of two extreme evils).

This allows us to calculate an average and standard deviation for these touch times:

numbers=($(cat gluster_timings_var_spool_mail.log | awk '{ print $2 }'))
total=$(bc < <( echo -n 0; for n in "${numbers[@]}"; do echo -n " + $n"; done; echo ))
avg=$(bc <<<"scale=3; $total / ${#numbers[@]}")
below=$(bc <<<"b=0; $(for n in "${numbers[@]}"; do echo "b+=($n<$avg)"; done); b")
stddev=$(bc <<<"scale=3; s=0; $(for n in "${numbers[@]}"; do echo "s+=($n - $avg)^2"; done); sqrt(s/${#numbers[@]})")
echo -e "avg: $avg\nbelow avg: ${below} / ${#numbers[@]} ($(bc<<<"scale=2; ${below}*100/${#numbers[@]}") %)\nstddev: $stddev"

This gives on the four servers running clients:

avg: 3.238
below avg: 1416 / 1578 (89.73 %)
stddev: 10.805

avg: 2.048
below avg: 1486 / 1592 (93.34 %)
stddev: 8.774

avg: .028
below avg: 1500 / 1608 (93.28 %)
stddev: .184

avg: .026
below avg: 1499 / 1607 (93.27 %)
stddev: .094

The latter two servers are primarily used for inbound filtering, and is set to handle 1/11th of the overall load between them (haproxy, primary two servers set to weight 10, the two normally backup-only, set to weight 1 - as a debug to try and determine if it's FUSE or the bricks). We're currently copying in ~1.3TB worth of data via FUSE on the last server above (using rsync).

As can be plainly seen, the averages on the latter two servers are MUCH lower than on the primary two, somewhat expected since they don't access the filesystem as much. The problem has been with us for a while, but since we replaced the drives with NL-SAS compared to SATA it has definitely gotten better. It's not solved, however, it's currently worse than ever with the data being copied in.

Counting entries taking over 5 seconds only (arbitrary value), we get 99 and 157 entries on the two primary servers, and only one on the backup/standby servers. Please refer long_times.csv - fields are seconds from epoch, date/time, time taken for touch, server - in my opinion it's clear that there are clusters here in terms of time, and frequently involves multiple servers:

 22 1648031761.732,Wed 23 Mar 2022 12:36:01 SAST,49.964,uriel
 23 1648031881.413,Wed 23 Mar 2022 12:38:01 SAST,33.037,uriel
 24 1648031941.973,Wed 23 Mar 2022 12:39:01 SAST,13.688,bagheera
 25 1648032001.416,Wed 23 Mar 2022 12:40:01 SAST,49.969,uriel
 26 1648032002.036,Wed 23 Mar 2022 12:40:02 SAST,49.956,bagheera
 27 1648032121.358,Wed 23 Mar 2022 12:42:01 SAST,8.051,uriel
 28 1648032181.474,Wed 23 Mar 2022 12:43:01 SAST,34.679,uriel
 29 1648032241.977,Wed 23 Mar 2022 12:44:01 SAST,9.939,bagheera
 30 1648032301.278,Wed 23 Mar 2022 12:45:01 SAST,49.968,uriel
 31 1648032302.029,Wed 23 Mar 2022 12:45:02 SAST,21.952,bagheera
 32 1648032361.148,Wed 23 Mar 2022 12:46:01 SAST,25.050,bagheera
 33 1648032421.181,Wed 23 Mar 2022 12:47:01 SAST,21.824,bagheera
 34 1648032481.297,Wed 23 Mar 2022 12:48:01 SAST,49.973,bagheera

These servers has ~150 courier-imapd processes each at any given point in time. Numerous pop3d processes gets started and stopped (majority of users are using POP3 with a fair number using IMAP, currently holiday period in SA, so not normal load even).

Filesystem usage stats for the two filesystems on the bricks:

uriel [11:13:25] ~ # df -m /mnt/gluster/; df -i /mnt/gluster/
Filesystem              1M-blocks    Used Available Use% Mounted on
/dev/mapper/lvm-gluster   3096240 2181126    915098  71% /mnt/gluster
Filesystem                 Inodes   IUsed     IFree IUse% Mounted on
/dev/mapper/lvm-gluster 201326592 9328669 191997923    5% /mnt/gluster

bagheera [11:15:04] ~ # df -m /mnt/gluster/; df -i /mnt/gluster/
Filesystem              1M-blocks    Used Available Use% Mounted on
/dev/mapper/lvm-gluster   3096240 2181332    914892  71% /mnt/gluster
Filesystem                 Inodes   IUsed     IFree IUse% Mounted on
/dev/mapper/lvm-gluster 201326592 9357588 191969004    5% /mnt/gluster

The inode discrepancy is the subject of another bug (#3196).

As you can plainly see the average file size is ~240KB/file. There are certain folders with more files than others.

The filesystems are ext4, with the following features set:

Filesystem features:      has_journal ext_attr resize_inode filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize

NOTE: No dir_index as this has been found to actually worsen readdir() performance instead of improving it. We're happy to switch this back on if there is some confidence that it will improve things overall, we'd like some way of preferably confirming this before making the change (It's significant filesystem downtime to enable/disable this feature, and we're going to have a fair number of gfid split brains as a result due to the way maildir functions).

Worst files/folder counts:

bagheera /mnt/gluster # cat ~/gluster_brick_files_per_folder.txt | sort -gr | head | while read C P; do echo -e "$C\t${P%%/*}/$(md5sum <<<"${P#*/}" | sed -re 's/ .*//')"; done
302614  mail-a/355fe515176ef47e26cf2c949c33f3c6
239402  mail-b/355fe515176ef47e26cf2c949c33f3c6
69720   mail-a/3b4f9277b7fd9a326e974c3d4a8ea1ef
68972   mail-b/3b4f9277b7fd9a326e974c3d4a8ea1ef
67548   mail-b/c3a5f807c49d829d0131741a006e6a89
67392   mail-a/c3a5f807c49d829d0131741a006e6a89
36881   mail-b/d9e95ea15b8957423aaee669afa40cf7
36788   mail-a/d9e95ea15b8957423aaee669afa40cf7
36575   mail-a/84a56cd8b4a1a94c24f6465ae6e504e1
35763   mail-a/a967864278ec9772c277d74c0e360a5d

For uriel:

302614  mail-b/355fe515176ef47e26cf2c949c33f3c6
239402  mail-a/355fe515176ef47e26cf2c949c33f3c6
69720   mail-b/3b4f9277b7fd9a326e974c3d4a8ea1ef
68972   mail-a/3b4f9277b7fd9a326e974c3d4a8ea1ef
67548   mail-a/c3a5f807c49d829d0131741a006e6a89
67392   mail-b/c3a5f807c49d829d0131741a006e6a89
36881   mail-a/d9e95ea15b8957423aaee669afa40cf7
36788   mail-b/d9e95ea15b8957423aaee669afa40cf7
36575   mail-b/84a56cd8b4a1a94c24f6465ae6e504e1
35764   mail-b/a967864278ec9772c277d74c0e360a5d

Plainly there are a handful of terrible folders, the worst seems to be a deleted messages which I'll also clean out momentarily (>540k files if summing the mail-a and mail-b sides together). Running time ls on these four folders on the bricks consistently gives 0.5s times on recurring ls runs, and between 10 and 30 seconds on the initial "cold" run (which is also indicative that disk caches etc is still playing a role here).

FUSE mount point (I can't think of any FLOCKs on this filesystem, so perhaps NFS is a potential alternative? Even though my benchmarks a few years back indicated that NFS performed worse than FUSE, especially with relation to IO latency), mounted from fstab, the resulting command:

/usr/sbin/glusterfs --reader-thread-count=32 --fuse-mountopts=noatime,nosuid,noexec,nodev --process-name fuse --volfile-server=localhost --volfile-id=gv_mail --fuse-mountopts=noatime,nosuid,noexec,nodev /var/spool/mail

Increasing the various thread counts helped, but it didn't solve the problem. Similar for the performance.cache-size

Unfortunately not all the documentation is clear as to where (client/server) options are being effected, and whether or not remounts/brick restarts are required. We've restarted all bricks since changing the last settings, and we've also remounted all clients.

We actioned similar mechanisms for touch directly on the underlying filesystems and are not seeing the same effects.

The exact command to reproduce the issue:

No specific simple way we could isolate and we've been at this since last week already.

The full output of the command that failed:

No command that's failing, things are just unacceptably slow. gluster volume performance info peek attached: performance.txt

Expected results:

We don't expect to see excessive latency spikes (we're not so sensitive that we need us range consistently, but we cannot afford for one blocking transaction to completely back-log the system).

Mandatory info: - The output of the gluster volume info command:

Volume Name: gv_mail
Type: Distributed-Replicate
Volume ID: fdb043f4-f290-4ca3-b747-8c02d917fa75
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: bagheera:/mnt/gluster/mail-a
Brick2: uriel:/mnt/gluster/mail-b
Brick3: uriel:/mnt/gluster/mail-a
Brick4: bagheera:/mnt/gluster/mail-b
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.least-prio-threads: 4
performance.low-prio-threads: 32
performance.normal-prio-threads: 32
performance.high-prio-threads: 32
performance.io-thread-count: 32
diagnostics.client-log-level: WARNING
diagnostics.brick-log-level: WARNING
cluster.granular-entry-heal: enable
cluster.data-self-heal-algorithm: full
performance.open-behind: off
client.event-threads: 32
server.event-threads: 64
performance.cache-size: 1GB
performance.readdir-ahead: on
cluster.data-self-heal: off
cluster.entry-self-heal: off
cluster.metadata-self-heal: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on
performance.write-behind: off
cluster.self-heal-daemon: enable
cluster.locking-scheme: granular

- The output of the gluster volume status command:

Status of volume: gv_mail
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick bagheera:/mnt/gluste
r/mail-a                                    53156     0          Y       13771
Brick uriel:/mnt/gluster/m
ail-b                                       60847     0          Y       15455
Brick uriel:/mnt/gluster/m
ail-a                                       52199     0          Y       15571
Brick bagheera:/mnt/gluste
r/mail-b                                    58242     0          Y       13869
Self-heal Daemon on localhost               N/A       N/A        Y       21206
Self-heal Daemon on bagheera.interexcel.co.
za                                          N/A       N/A        Y       6012 

Task Status of Volume gv_mail
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command:

**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

**- Is there any crash ? Provide the backtrace and coredump

FUSE mount gets to a point where we have to force remount.

Additional info:

- The operating system / glusterfs version:

Gentoo Linux, glusterfs 10.1

mohit84 commented 2 years ago

Can you share the below info from the system(server and client both)

1) brick logs and client logs
2) cat /proc/cpuinfo
3) free -m
4) ps -aef | grep gluster
5) Run below script almost 3 to 5 min while the performance is slow 
   while [ 1 ]
do
high_cpu=100.0
 for pid in `pgrep gluster`
 do
    cpu=`top -b -n 1 -p $pid | tail -1 | awk '{print $9}'`
    if (( $(echo "$cpu > $high_cpu" |bc -l) ));
    then
         echo "Capture pstack for next 30 attempt date for pid $pid cpu usage is $cpu"
         for i in {1..30} 
         do
             echo "Capture cpu usage for attempt $i"     
             pstack $pid
             sleep 3 
         done
    fi
 done
sleep 20
done  

6) Generate statedump for brick and client process, to generate a statdump run below command
   kill -SIGUSR1 <gluster_process>, the statedump will be save in /var/run/gluster
jkroonza commented 2 years ago

Herewith so long points 2, 3 and 4:

bagheera-free.txt bagheera-ps_gluster.txt bagheera-cpuinfo.txt uriel-ps_gluster.txt uriel-free.txt uriel-cpuinfo.txt garmr-ps_gluster.txt garmr-free.txt garmr-cpuinfo.txt othala-cpuinfo.txt othala-ps_gluster.txt othala-free.txt

Monitor script updated to not reload but still keep notifying us.

Regarding 1 - Looking to wait until the next cycle for 5 and 6 as well. Currently the log folders accross the system is around 5GB in total, so will just filter for today's logs and then provide a download link if that's OK?

mohit84 commented 2 years ago

Monitor script updated to not reload but still keep notifying us.

Thanks, i will wait for other info.

jkroonza commented 2 years ago

don't have pstack, and i will need to re-install with debug info for gdb, gdb back traces ok instead of pstack? Do you need for several minutes or can I just grab point in time backtraces when I detect the state and then proceed to remount? Happy to push for a little bit before re-mounting, but not overly long. I assume you need both bricks and clients when this happens? So I really need to coordinate this overall?

jkroonza commented 2 years ago

Ok, so using gdb -batch -p $pid -ex "thread apply all bt" is not viable ... it takes several minutes just to attach to the offending proces, and during this time things gets considerably worse. Does pstack perform better in this regard? In which case I'll make a plan to get it deployed. The only process that reached 100% CPU was one of the FUSE processes.

We've noticed another interesting one now ...

During a massive, massive LA spike, where all four servers went >100 LA, various stuff broked, in some cases to the point where even bash wouldn't return a shell (even though re-ssh'ing and then executing whatever commands were fine). Normally we could just remount on the affected server, this time around we had to kill all bricks and restart (which we did in a rolling manner, so all bricks on bagheera, start ... force, then same on uriel after attempting to wait for heal, hereafter remount of all clients resolved the situation for the time being).

gluster volume heal gv_mail info blocked (whilst waiting for inter-brick restart heal to complete) ... and upon remount of one of the mountpoints this too recovered. We have no idea what this means, but we do believe that somehow there is interaction between the different clients, but that it's the bricks at the end of the day that's causing the problem somehow.

Since we restarted all bricks, and during recovery we noticed several recurring files of the form: /pm/pmbadmin/tmp/1648212670.M457337P496_courierlock.uriel (which is using the dotlockfile scheme based on the source code from courier-libs).

During this time, the only log entries (unless I misseed a few) were all similar to this:

[2022-03-20 04:57:08.192186 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2618:client4_0_lookup_cbk] 0-gv_mail-client-2: remote operation failed. [{path=/mm/mmm2/cur/1607408671.H658815P30781.garmr.iewc.co.za,S=99016:2,}, {gfid=8e86fd06-5bca-4533-a81f-ea56ae11fc70}, {errno=2}, {error=No such file or directory}] 
[2022-03-20 05:27:07.078316 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2618:client4_0_lookup_cbk] 0-gv_mail-client-3: remote operation failed. [{path=/mm/mmm2/cur/1607392399.H133704P27717.garmr.iewc.co.za,S=85161:2,}, {gfid=d110a3e9-a10a-4dd9-a673-af40c0da1977}, {errno=2}, {error=No such file or directory}] 
[2022-03-20 05:27:07.078403 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2618:client4_0_lookup_cbk] 0-gv_mail-client-2: remote operation failed. [{path=/mm/mmm2/cur/1607392399.H133704P27717.garmr.iewc.co.za,S=85161:2,}, {gfid=d110a3e9-a10a-4dd9-a673-af40c0da1977}, {errno=2}, {error=No such file or directory}] 
[2022-03-20 05:27:07.081434 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2618:client4_0_lookup_cbk] 0-gv_mail-client-3: remote operation failed. [{path=/mm/mmm2/cur/1607408671.H658815P30781.garmr.iewc.co.za,S=99016:2,}, {gfid=8e86fd06-5bca-4533-a81f-ea56ae11fc70}, {errno=2}, {error=No such file or directory}]

So I'm afraid not much use. On the bricks themselves all entries were similar to this:

[2022-03-25 12:53:12.178655 +0000] E [MSGID: 113037] [posix-entry-ops.c:2206:posix_link] 0-gv_mail-posix: link /mnt/gluster/mail-a/.glusterfs/33/0f/330f8443-c92b-46b1-ab2d-aebdff6b703d to /mnt/gluster/mail-a/im/imediajason/.Sent Messages/tmp/courier.lock failed [File exists]
[2022-03-25 12:53:12.194358 +0000] E [MSGID: 113037] [posix-entry-ops.c:2206:posix_link] 0-gv_mail-posix: link /mnt/gluster/mail-a/.glusterfs/bf/19/bf196965-f709-40e5-9c03-d8a93e95096e to /mnt/gluster/mail-a/im/imediajason/.Sent Messages/tmp/courier.lock failed [File exists]
[2022-03-25 12:53:12.210769 +0000] E [MSGID: 113037] [posix-entry-ops.c:2206:posix_link] 0-gv_mail-posix: link /mnt/gluster/mail-a/.glusterfs/88/0e/880ee633-ceaf-45eb-b4ee-99e80c87756f to /mnt/gluster/mail-a/im/imediajason/.Sent Messages/tmp/courier.lock failed [File exists]
[2022-03-25 12:53:12.234188 +0000] E [MSGID: 113037] [posix-entry-ops.c:2206:posix_link] 0-gv_mail-posix: link /mnt/gluster/mail-a/.glusterfs/6b/17/6b17688b-486e-4a5a-b5e4-830d44cc675a to /mnt/gluster/mail-a/im/imediajason/.Sent Messages/tmp/courier.lock failed [File exists]
[2022-03-25 12:53:12.256100 +0000] E [MSGID: 113037] [posix-entry-ops.c:2206:posix_link] 0-gv_mail-posix: link /mnt/gluster/mail-a/.glusterfs/91/a7/91a7eff9-2c9d-4582-959c-bc1e82af3029 to /mnt/gluster/mail-a/im/imediajason/.Sent Messages/tmp/courier.lock failed [File exists]
[2022-03-25 12:53:12.287119 +0000] E [MSGID: 113037] [posix-entry-ops.c:2206:posix_link] 0-gv_mail-posix: link /mnt/gluster/mail-a/.glusterfs/d9/7a/d97a5f28-9428-44fd-9e55-7572ab02e0b6 to /mnt/gluster/mail-a/im/imediajason/.Sent Messages/tmp/courier.lock failed [File exists]

And (presumably when we remount):

[2022-03-25 12:53:47.800559 +0000] W [socket.c:750:__socket_rwv] 0-tcp.gv_mail-server: readv on 154.73.34.73:49140 failed (No data available)
[2022-03-25 12:53:51.656247 +0000] W [socket.c:750:__socket_rwv] 0-tcp.gv_mail-server: readv on 154.73.34.73:49121 failed (No data available)
[2022-03-25 12:53:53.491409 +0000] W [socket.c:750:__socket_rwv] 0-tcp.gv_mail-server: readv on 154.73.34.73:49110 failed (No data available)
[2022-03-25 12:53:54.880257 +0000] W [socket.c:750:__socket_rwv] 0-tcp.gv_mail-server: readv on 154.73.34.73:49103 failed (No data available)
[2022-03-25 12:55:01.058983 +0000] E [MSGID: 113001] [posix-inode-fd-ops.c:3888:posix_getxattr] 0-gv_mail-posix: getxattr failed on /mnt/gluster/mail-a/.glusterfs/5c/9a/5c9aba67-4400-42d6-a27f-117761648e9b (path: <gfid:5c9aba67-4400-42d6-a27f-117761648e9b>): trusted.glusterfs.dht.linkto  [No such file or directory]
[2022-03-25 12:55:03.459032 +0000] E [MSGID: 113052] [posix-inode-fd-ops.c:420:posix_setattr] 0-gv_mail-posix: setattr (utimes) on gfid-handle /mnt/gluster/mail-a/ci/cis10/tmp (path: /ci/cis10/tmp) failed [Invalid argument]
[2022-03-25 12:55:16.710372 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-gv_mail-server: releasing lock on c5dc93f9-d494-4cf4-a1f2-ef1ec6ae26f2 held by {client=0x7f42f4087058, pid=3974 lk-owner=c8a40528047f0000}
[2022-03-25 12:55:16.710426 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-gv_mail-server: releasing lock on 9a44d137-f07d-4942-91f2-d4b3165912a6 held by {client=0x7f42f4087058, pid=4260 lk-owner=58de0470047f0000}
[2022-03-25 12:55:16.710451 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-gv_mail-server: releasing lock on 77a081bc-1b79-42f7-973d-bc7abb861886 held by {client=0x7f42f4087058, pid=11874 lk-owner=482a0580047f0000}
[2022-03-25 12:55:16.710473 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-gv_mail-server: releasing lock on def10065-6d3d-4a09-a121-f15b9a5fc23c held by {client=0x7f42f4087058, pid=4776 lk-owner=e86e0268047f0000}
[2022-03-25 12:55:16.710496 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-gv_mail-server: releasing lock on d2ddc4b7-b612-4bbb-9fc1-68d436f42207 held by {client=0x7f42f4087058, pid=-4 lk-owner=f8650514047f0000}
[2022-03-25 12:55:16.710525 +0000] W [inodelk.c:617:pl_inodelk_log_cleanup] 0-gv_mail-server: releasing lock on d0c905f9-7b55-4522-9b92-68a1799c68dd held by {client=0x7f42f4087058, pid=11742 lk-owner=5887002c047f0000}

A very small number of entries like:

[2022-03-25 12:55:34.610020 +0000] E [MSGID: 113052] [posix-inode-fd-ops.c:420:posix_setattr] 0-gv_mail-posix: setattr (utimes) on gfid-handle /mnt/gluster/mail-a/.glusterfs/f4/b3/f4b3ac37-0ef1-4d40-8104-ad93a589186f/tmp (path: <gfid:092e428a-d14b-41e8-9de4-450e09c1537e>) failed [Invalid argument]
[2022-03-25 12:55:34.628118 +0000] E [MSGID: 113052] [posix-inode-fd-ops.c:420:posix_setattr] 0-gv_mail-posix: setattr (utimes) on gfid-handle /mnt/gluster/mail-a/.glusterfs/0d/33/0d33171d-a90b-475b-9f5a-453a3431b259/tmp (path: <gfid:f3c64521-cc4c-4d3d-9f87-0f8858a06b1e>) failed [Invalid argument]
[2022-03-25 12:55:34.638190 +0000] E [MSGID: 113052] [posix-inode-fd-ops.c:420:posix_setattr] 0-gv_mail-posix: setattr (utimes) on gfid-handle /mnt/gluster/mail-a/.glusterfs/cb/c5/cbc5077f-203b-4a1c-a91f-266762682cdd/tmp (path: <gfid:5e828830-8ec5-4d54-afba-dbb67d7e8ebb>) failed [Invalid argument]
[2022-03-25 12:55:34.767110 +0000] E [MSGID: 113052] [posix-inode-fd-ops.c:420:posix_setattr] 0-gv_mail-posix: setattr (utimes) on gfid-handle /mnt/gluster/mail-a/.glusterfs/64/61/6461160c-a157-42ca-a123-e54bfe0fb3ac/tmp (path: <gfid:013301bf-a722-4456-9592-0f916fb312a7>) failed [Invalid argument]

Will gather all logs in full and attach here later tonight.

mohit84 commented 2 years ago

I think first you should reduce the number of threads for client and then try to observe the performance. 1) Mount the volume without declaring reader-thread-count, in default case reader-thread-count is 1. The high amount of reader-thread-count would not be useful. 2) Disable performance.client-io-threads 3) Configure (client|server).event-threads to 4 the current configuration is useless you would not get improvement.

jkroonza commented 2 years ago

Done.

Settings changed. ( bricks killed; shds killed; glusterd restarted) per server node; all filesystems remounted.

Given that we've currently got <5% clients connected, load averages are feeling excessive currently, and latency immediately started reporting 20s+, with highest so far (in less than 15 minutes from above being 23.8s). Could still have been caused by SHDs repairing restart damage, so let's discount that for the moment.

load averages have dropped slightly, but even returning shell feels sluggish at the moment.

Updated gluster volume info:

Type: Distributed-Replicate
Volume ID: fdb043f4-f290-4ca3-b747-8c02d917fa75
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: bagheera:/mnt/gluster/mail-a
Brick2: uriel:/mnt/gluster/mail-b
Brick3: uriel:/mnt/gluster/mail-a
Brick4: bagheera:/mnt/gluster/mail-b
Options Reconfigured:
diagnostics.client-log-level: WARNING
diagnostics.brick-log-level: WARNING
performance.open-behind: off
client.event-threads: 4
server.event-threads: 4
performance.cache-size: 1GB
performance.readdir-ahead: on
cluster.data-self-heal: off
cluster.entry-self-heal: off
cluster.metadata-self-heal: off
transport.address-family: inet
nfs.disable: on
performance.write-behind: off
cluster.self-heal-daemon: enable
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.granular-entry-heal: enable
performance.io-thread-count: 32
performance.high-prio-threads: 32
performance.normal-prio-threads: 32
performance.low-prio-threads: 32
performance.least-prio-threads: 4
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
storage.fips-mode-rchecksum: on

Settings I'm not familiar with is fips-mode-rchecsum, not sure where that is coming from. Effect?

You made me lower other thread counts (which is something we discussed internally as well, but based on "well, they'll just sit there and idle, possibly consuming a bit of extra RAM)

... and first case of forced remount just came in. I'm guessing I'm going to have to revert these changes tomorrow during peak load ... but lets see.

I have also discovered something else by accident:

bagheera [21:27:21] /mnt/gluster/mail-b/nuke_me

So ls -lad /var/spool/mail/nuke-me should reflect data right? Well, no, apparently this folder doesn't exist at all. In there is a tmp folder of size 41M - even a direct ls on the server (assuming glusterfs is running) is

And a few more cases ... reverting ... this is FAR FAR worse than it was. Can't go into production like this for the coming week. With forced remounts switched off ... basically even a cd into a glusterfs fileystem hangs for several minutes. Need alternative ideas here. Only things I can think:

  1. dir_index might actually be beneficial.
  2. Underlying filesystem damage
  3. glusterfs brick layout is broken somehow and a full heal is required somehow (we've got other bugs related to the flock stuff resutling on other systems to require us to peridically rebuild full bricks ... just afraid that this process will in this case cause too much load for clients to operate during this time, and it's HIGH risk, on the other set we've got replica 3 so much lower risk to do that, plus it's not IO heavy)
jkroonza commented 2 years ago

Just FYI: on the local system getting sluggish, the moment I take the load off of the server even local filesystem operations that should NOT BE IMPACTED by glusterfs FUSE performance are. Eg, an ls on the 41MB tmp/ folder suddenly takes sub 1s, where even just a cd into it previously took mnutes to complete. I have no explanation for that.

Is it possible to downgrade glusterfs? Seems that the 8 branch is still the most stable and reliable, so one option is to head back there. I note uring has been added, but I require a stop of the volume to enable that - any chance that would help here?

jkroonza commented 2 years ago

So some new information: I took down the bricks on bagheera to enable dir_index and run fsck -fD (process which will run through the night at current rate). At this point bagheera feels much more responsive than uriel concurrently, both sides mounted and taking customer load.

Different theory, not knowing how the kernel IO system works: What's the chance that the kernel and userspace is effectively deadlocking in a way ... we've seen similar with apache and php with what I call "loop back" requests (ie [external customer] =HTTP=> [apache] =FCGI=> [php-fpm] =HTTP=> [apache] ... now, if apache has limited "dispatch" (eg, mod_prefork) then there comes a time when if there are enough external customers incoming where the php-fpm processes are effectively not getting their HTTP requests processed by apache, obviously the fix here is that you shouldn't do http based loop backs (but this is VERY common practise), or alternatively to cluster stuff such that there are separate nodes dealing with external and internal http requests (so that one can manage resource allocations).

Currently this isn't viable for us, but if we can prove that it solves the issue we can certainly try to make a plan to enable that. This would explain a lot:

processes are blocking on IO not just those on gluster fs filesystems (ie, the kernel reaches a maximum number of IO requests and cannot queue any more, so stuff block until things can clear).

What it most certainly does not explain is why when shell processes block I can ssh in directly next to that and execute things like umount -fl /var/spool/mail which then immediately releases all processes that's blocking ... not the release portion mind you, the why is it even possible to ssh in in this case.

Theory is thus that the IO requests are issued to the kernel via syscall (read/write/whatever), these calls then block pending space in some IO queue somewhere, thus they end up not getting issued (either to FUSE or the disk), when it gets to the point where there are enough requests issued to FUSE, and send back to the bricks, but the bricks aren't getting stuff into the queue due to queue's being full ... bricks don't get serviced, so glusterfs doesn't get responses from the bricks - or get it slowly, so it can't respond to the kernel on IO issued to FUSE, so more requests backlog ... eventually everything is just clogged up and nothing ends up working.

mohit84 commented 2 years ago

So some new information: I took down the bricks on bagheera to enable dir_index and run fsck -fD (process which will run through the night at current rate). At this point bagheera feels much more responsive than uriel concurrently, both sides mounted and taking customer load.

Different theory, not knowing how the kernel IO system works: What's the chance that the kernel and userspace is effectively deadlocking in a way ... we've seen similar with apache and php with what I call "loop back" requests (ie [external customer] =HTTP=> [apache] =FCGI=> [php-fpm] =HTTP=> [apache] ... now, if apache has limited "dispatch" (eg, mod_prefork) then there comes a time when if there are enough external customers incoming where the php-fpm processes are effectively not getting their HTTP requests processed by apache, obviously the fix here is that you shouldn't do http based loop backs (but this is VERY common practise), or alternatively to cluster stuff such that there are separate nodes dealing with external and internal http requests (so that one can manage resource allocations).

Currently this isn't viable for us, but if we can prove that it solves the issue we can certainly try to make a plan to enable that. This would explain a lot:

processes are blocking on IO not just those on gluster fs filesystems (ie, the kernel reaches a maximum number of IO requests and cannot queue any more, so stuff block until things can clear).

What it most certainly does not explain is why when shell processes block I can ssh in directly next to that and execute things like umount -fl /var/spool/mail which then immediately releases all processes that's blocking ... not the release portion mind you, the why is it even possible to ssh in in this case.

Theory is thus that the IO requests are issued to the kernel via syscall (read/write/whatever), these calls then block pending space in some IO queue somewhere, thus they end up not getting issued (either to FUSE or the disk), when it gets to the point where there are enough requests issued to FUSE, and send back to the bricks, but the bricks aren't getting stuff into the queue due to queue's being full ... bricks don't get serviced, so glusterfs doesn't get responses from the bricks - or get it slowly, so it can't respond to the kernel on IO issued to FUSE, so more requests backlog ... eventually everything is just clogged up and nothing ends up working.

The fips-mode-rchecksum option is use by posix_rchecksum and shd is a primary consumer of this fop. If enabled it uses the FIPS complaint sha256 checksum otherwise use MD5 checksum.

As you mentined you are not able to generate backtrace of a process through pstack can you please apply below steps to generate kernel stack trace when file system hangs.

1) Enable the backtrace generation for blocked processes using the below command on the nodes where bricks are hosted.

echo 1 > /proc/sys/kernel/sysrq

  # echo 'w' > /proc/sysrq-trigger

2) Ensure by checking file /var/log/messages that backtraces are captured in it. 3) Disable the backtrace generation using the below command

echo 0 > /proc/sys/kernel/sysrq

Share the /var/log/messages from all the nodes.

mohit84 commented 2 years ago

Just FYI: on the local system getting sluggish, the moment I take the load off of the server even local filesystem operations that should NOT BE IMPACTED by glusterfs FUSE performance are. Eg, an ls on the 41MB tmp/ folder suddenly takes sub 1s, where even just a cd into it previously took mnutes to complete. I have no explanation for that.

Is it possible to downgrade glusterfs? Seems that the 8 branch is still the most stable and reliable, so one option is to head back there. I note uring has been added, but I require a stop of the volume to enable that - any chance that would help here?

Without finding a RCA it would be difficult to suggest downgrade on release-8, if the issue is on system side then even downgrade would not be useful.

jkroonza commented 2 years ago

Not picked up any D state glusterfs(d) process, but I am seeing lots and lots of these (or similar, all the calls at the top are __schedule, schedule, from there it varies a bit but always some filesystem call I expect to be on fuser/glusterfs):

[Mon Mar 28 07:46:40 2022] Call Trace:
[Mon Mar 28 07:46:40 2022]  __schedule+0x281/0x5f0
[Mon Mar 28 07:46:40 2022]  schedule+0x45/0xb0
[Mon Mar 28 07:46:40 2022]  schedule_preempt_disabled+0x5/0x10
[Mon Mar 28 07:46:40 2022]  __mutex_lock.isra.0+0x247/0x4a0
[Mon Mar 28 07:46:40 2022]  ? __alloc_pages_nodemask+0xe2/0x180
[Mon Mar 28 07:46:40 2022]  fuse_lock_inode+0x23/0x30
[Mon Mar 28 07:46:40 2022]  fuse_readdir_uncached+0x150/0x6e0
[Mon Mar 28 07:46:40 2022]  ? prep_new_page+0x7d/0xc0
[Mon Mar 28 07:46:40 2022]  fuse_readdir+0x12a/0x5a0
[Mon Mar 28 07:46:40 2022]  ? page_add_new_anon_rmap+0x56/0x100
[Mon Mar 28 07:46:40 2022]  ? handle_mm_fault+0x10cc/0x14b0
[Mon Mar 28 07:46:40 2022]  iterate_dir+0x14e/0x160
[Mon Mar 28 07:46:40 2022]  ksys_getdents64+0x64/0xf0
[Mon Mar 28 07:46:40 2022]  ? do_vfs_ioctl.cold+0x42/0x42
[Mon Mar 28 07:46:40 2022]  __x64_sys_getdents64+0x11/0x20
[Mon Mar 28 07:46:40 2022]  do_syscall_64+0x3e/0x70
[Mon Mar 28 07:46:40 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Mon Mar 28 07:46:40 2022] RIP: 0033:0x7f074faccc17
[Mon Mar 28 07:46:40 2022] Code: Bad RIP value.
[Mon Mar 28 07:46:40 2022] RSP: 002b:00007ffea11a38c8 EFLAGS: 00000293 ORIG_RAX: 00000000000000d9
[Mon Mar 28 07:46:40 2022] RAX: ffffffffffffffda RBX: 000055c31d984f20 RCX: 00007f074faccc17
[Mon Mar 28 07:46:40 2022] RDX: 0000000000020000 RSI: 000055c31d984f50 RDI: 0000000000000004
[Mon Mar 28 07:46:40 2022] RBP: 000055c31d984f50 R08: 00000000000004c8 R09: 00007f074fbc4c00
[Mon Mar 28 07:46:40 2022] R10: 000000000000008e R11: 0000000000000293 R12: ffffffffffffff88
[Mon Mar 28 07:46:40 2022] R13: 000055c31d984f24 R14: 0000000000000002 R15: 0000000000000000

(directly from dmesg.) I feel this re-inforces me theory that the kernel isn't getting the IO requests out to the fuse user-space process.

I've since executed "sysctl -w fs.pipe-max-size=16777216" (based in part on information from https://fuse-devel.narkive.com/FcKo4K4o/rfc-patch-0-5-fuse-make-maximum-read-write-request-size-tunable). Can't find my other sources any more, but effectively, I believe if the pipe is too small, processes will block waiting for space.

Now, I do agree (based on everything I've seen and learnt in the last 72 hours) that having 4 fuse reader threads should make no difference, however, practical experience differs from that, not sure how the pipe is handled, or even if setting above sysctl (was default 1MB) will (should) make a difference. We've swapped some of the functions on the servers around now to try and avoid the theoretical deadlock scenario described earlier and so far it seems a lot better (subjective, but my phone is definitely not as busy alerting).

gluster itself has been restored to full redundancy (downed bricks are back and heal completed), looking to compare load averages of the two servers containing bricks during the day to try and gauge if dir_index is useful or not, however, with it now handling incoming SMTP as well, that won't be a clear indicator. So far it does seem that LA on the server where dir_index is now active is lower - so it would seem that this helps for this specific use-case (and detrimental for every other use-case we've ever looked at). This is VERY unscientific though.

And as I'm about to send, the first D fuse process comes through:

[Mon Mar 28 08:58:41 2022] sysrq: Show Blocked State
[Mon Mar 28 08:58:41 2022]   task                        PC stack   pid father
[Mon Mar 28 08:58:41 2022] glfs_fusenoti   D    0 20113      1 0x00000000
[Mon Mar 28 08:58:41 2022] Call Trace:
[Mon Mar 28 08:58:41 2022]  __schedule+0x281/0x5f0
[Mon Mar 28 08:58:41 2022]  ? rwsem_optimistic_spin+0xaf/0x1e0
[Mon Mar 28 08:58:41 2022]  schedule+0x45/0xb0
[Mon Mar 28 08:58:41 2022]  rwsem_down_write_slowpath+0x1dd/0x420
[Mon Mar 28 08:58:41 2022]  fuse_reverse_inval_entry+0x46/0x1e0
[Mon Mar 28 08:58:41 2022]  fuse_dev_do_write+0x42a/0xd10
[Mon Mar 28 08:58:41 2022]  ? futex_wait+0xd3/0x1e0
[Mon Mar 28 08:58:41 2022]  fuse_dev_write+0x41/0x60
[Mon Mar 28 08:58:41 2022]  do_iter_readv_writev+0x12a/0x190
[Mon Mar 28 08:58:41 2022]  do_iter_write+0x78/0x190
[Mon Mar 28 08:58:41 2022]  vfs_writev+0x7b/0xd0
[Mon Mar 28 08:58:41 2022]  ? irq_exit_rcu+0x33/0xa0
[Mon Mar 28 08:58:41 2022]  ? sysvec_apic_timer_interrupt+0x34/0x80
[Mon Mar 28 08:58:41 2022]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[Mon Mar 28 08:58:41 2022]  do_writev+0x52/0xe0
[Mon Mar 28 08:58:41 2022]  do_syscall_64+0x3e/0x70
[Mon Mar 28 08:58:41 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Mon Mar 28 08:58:41 2022] RIP: 0033:0x7f5bcb60713d
[Mon Mar 28 08:58:41 2022] Code: Bad RIP value.
[Mon Mar 28 08:58:41 2022] RSP: 002b:00007f5b91ffa020 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
[Mon Mar 28 08:58:41 2022] RAX: ffffffffffffffda RBX: 00005594220a96c8 RCX: 00007f5bcb60713d
[Mon Mar 28 08:58:41 2022] RDX: 0000000000000001 RSI: 00007f5b91ffa080 RDI: 0000000000000008
[Mon Mar 28 08:58:41 2022] RBP: 00005594220b5840 R08: 0000000000000000 R09: 0000000000000073
[Mon Mar 28 08:58:41 2022] R10: 0000000000000000 R11: 0000000000000293 R12: 00005594220b5880
[Mon Mar 28 08:58:41 2022] R13: 00007f5b38785238 R14: 00007f5b91ffa080 R15: 00007f5b38785218

Pointing back to fuse processing IMHO.

Kernel versions are all 5.8.14 ... for what that information is worth.

jkroonza commented 2 years ago

Back traces, I'll also go read kernel sources to try and figure out what's going on here, but that's not usual for me, uriel+bagheera (has been well behaved, and with with them not taking any load garmr+othala has certainly been better behaved).

Full stack traces from the two servers that have run into "remount required":

othala.txt garmr.txt

And summarised:

summarised.txt

By far the two most seen are:

Stack trace seen 285 times:
__schedule+0x281/0x5f0
schedule+0x45/0xb0
schedule_preempt_disabled+0x5/0x10
__mutex_lock.isra.0+0x247/0x4a0
? filemap_map_pages+0x1a8/0x340
lock_rename+0x23/0xb0
do_renameat2+0x22f/0x4a0
__x64_sys_rename+0x1a/0x20
do_syscall_64+0x3e/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Stack trace seen 253 times:
__schedule+0x281/0x5f0
schedule+0x45/0xb0
schedule_preempt_disabled+0x5/0x10
__mutex_lock.isra.0+0x247/0x4a0
lock_rename+0x23/0xb0
do_renameat2+0x22f/0x4a0
__x64_sys_rename+0x1a/0x20
do_syscall_64+0x3e/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9
mohit84 commented 2 years ago

I don't think it is a gluster issue, It seems either issue is on system resource or fuse-kernel module. @xhernandez Can you please share your view on the same?

mohit84 commented 2 years ago

Back traces, I'll also go read kernel sources to try and figure out what's going on here, but that's not usual for me, uriel+bagheera (has been well behaved, and with with them not taking any load garmr+othala has certainly been better behaved).

Full stack traces from the two servers that have run into "remount required":

othala.txt garmr.txt

And summarised:

summarised.txt

By far the two most seen are:

Stack trace seen 285 times:
__schedule+0x281/0x5f0
schedule+0x45/0xb0
schedule_preempt_disabled+0x5/0x10
__mutex_lock.isra.0+0x247/0x4a0
? filemap_map_pages+0x1a8/0x340
lock_rename+0x23/0xb0
do_renameat2+0x22f/0x4a0
__x64_sys_rename+0x1a/0x20
do_syscall_64+0x3e/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Stack trace seen 253 times:
__schedule+0x281/0x5f0
schedule+0x45/0xb0
schedule_preempt_disabled+0x5/0x10
__mutex_lock.isra.0+0x247/0x4a0
lock_rename+0x23/0xb0
do_renameat2+0x22f/0x4a0
__x64_sys_rename+0x1a/0x20
do_syscall_64+0x3e/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Can you share the /var/log/messages from one of the node where you are seeing these backtraces ?

jkroonza commented 2 years ago

Can you share the /var/log/messages from one of the node where you are seeing these backtraces ?

I'd have to filter quite heavily, is there something specific you're looking for? Or just the kernel log portion?

mohit84 commented 2 years ago

Just want to see kernel log portion only.

mohit84 commented 2 years ago

Can you try to mount the volume with default options, I have observed after passed noatime it ls-l performance reduced significantly.

I have tried to mount the volume with default option like mount -t glusterfs :/

After created files when i have tried to run ls-l it is fast but when i have mounted volume like below(as you are doing)

mount -t glusterfs :/ -onoatime,nosuid,noexec,nodev

and run ls-l in multiple threads it is too much slow.

jkroonza commented 2 years ago

Processes are now:

28236 ?        SLsl   0:06 /usr/sbin/glusterfs --process-name fuse --volfile-server=mail.iewc.co.za --volfile-id=gv_mail /var/spool/mail

We've also upgraded at least the above host to kernel 5.17.1 prior to updating the options, with both non-brick hosts upgraded it showed no improvement. However, with the removed options now it does seem better, I am baffled as to WHY though. Consistent 5-7s ls times on the root folder which contains a mere 400 sub-folders ... at least it's consistent.

It does not seem like the problem is completely resolved however. We do still periodically start seeing the time to touch a file climb, we've let it loose for the moment and will allow it up to four minutes (at which point most email clients would have given up). Under those circumstances I had an ls take 34m. W

We still play "high load ping pong" between the two remote clients ... at any point in time one of the two always wants to go high. We suspect this may be legitimate high load though with one of the filesystems clients (courier-imapd processes) doing something strange resulting in other processes being adversely affected.

For now it seems the IO just slows to a crawl though, not completely dies.. we're continuing to monitor. Seems to be getting better and better (probaby as backlog clears up capacity).

Logs (as per previous, from clean boot, kernel logs is from last rotation).

garmr.txt summarised.txt othala.txt garmr-kernel.txt othala-kernel.txt

mohit84 commented 2 years ago

It's a good news. Is it possible to capture new stack trace while you are observing an issue and share volume profile also.

jkroonza commented 2 years ago

Please refer the .txt files in previous post.

mohit84 commented 2 years ago

There is one other suggestion specific to configuration change, you can disable readdir-ahead. Here in you environment dht_subvol cnt is 2 so readdir-ahead would not be much useful. The other thing is there are some consistency issues with readdir-ahead so it would be good disable readdir-ahead.

xhernandez commented 2 years ago

@jkroonza have you checked the iostat output when you see those high latencies ?

# iostat -xt 1

What are the latencies of the brick disks and system disks ?

jkroonza commented 2 years ago

bricks and system disks response at these times were in the microsecond (sometimes millisecond) range. We checked this prior to filing this bug report.

We've taken the plunge and switched on io_uring (it was getting to the point where the systems were pretty much unusable anyway).

Loads have now on the two client-only bricks dropped to sub-4 with nearly 100 IMAP connections on each.

The brick nodes are at 20-30 variable and 10 respectively, with dir_index the only (known) difference between the two. iowait on the node where dir_index is enabled is the higher one. We are scheduling to disable that again and just discussing whether we do that during or after hours currently.

We don't know what caused the underlying problems, however, io_uring does seem to resolve this, we are now stable for just over an hour during "peak" times over all four hosts, whereas even overnight we barely ever got to as much as 30 minutes on any individual server. We've not had a single touch command exceed 2s (current trip) in the last hour (where the norm was anything from 15 to 30s prior to that).

Our mail queues are slowly working through it's backlog (a few thousand to go).

From our side I'd say that anybody that experiences something similar really should try enabling io_uring as follows:

  1. stop services relying on the mounts and umount all clients
  2. gluster volume stop VOLNAME
  3. gluster volume set VOLNAME storage.linux-io_uring on
  4. gluster volume start VOLNAME
  5. mount clients and start services relying on the mounts.

Trust us, take the plunge on the downtime and just do it, you're probably dealing with sporadic downtime due to client failures anyway. Typical ls times has gone down from minimum 5 to 7 seconds on fresh remounts, growing to minutes and now it's consistently sub 1 seconds (400 sub-folders in the top-level folder) with sporadic cases going just over with the highest we've seen at just under 5s since going to io_uring.

We do believe there is still an issue to sort out here, but we will be moving to io_uring going forward and staying there.

jkroonza commented 2 years ago

Then again: We are now getting some EAGAIN errors for imapd processes now. We don't know how the changes made triggered this, but it is rather frequent. This started when we took the one set of bricks offline to disable dir_index, so this probably correlates to load issues and we are contemplating whether some other max capacity may have been reached, and is now correctly back-propogating into an error that prior to io_uring were just absorbed into some queue resulting in a backlog that just never cleared.

jkroonza commented 2 years ago

There is one other suggestion specific to configuration change, you can disable readdir-ahead. Here in you environment dht_subvol cnt is 2 so readdir-ahead would not be much useful. The other thing is there are some consistency issues with readdir-ahead so it would be good disable readdir-ahead.

This has been updated, do we need to remount/restart bricks?

There seems to be lots of outdated and conflicting information/documentation around on the internet for glusterfs.

mohit84 commented 2 years ago

There is one other suggestion specific to configuration change, you can disable readdir-ahead. Here in you environment dht_subvol cnt is 2 so readdir-ahead would not be much useful. The other thing is there are some consistency issues with readdir-ahead so it would be good disable readdir-ahead.

This has been updated, do we need to remount/restart bricks?

There seems to be lots of outdated and conflicting information/documentation around on the internet for glusterfs.

No, in latest updated configuration (https://github.com/gluster/glusterfs/issues/3356#issuecomment-1080005262) it is still showing.

jkroonza commented 2 years ago

There is one other suggestion specific to configuration change, you can disable readdir-ahead. Here in you environment dht_subvol cnt is 2 so readdir-ahead would not be much useful. The other thing is there are some consistency issues with readdir-ahead so it would be good disable readdir-ahead.

This has been updated, do we need to remount/restart bricks? There seems to be lots of outdated and conflicting information/documentation around on the internet for glusterfs.

No, in latest updated configuration (#3356 (comment)) it is still showing.

Just set this prior to replying that it has been updated:

Option                                   Value                                  
------                                   -----                                  
performance.readdir-ahead                off                                    
jkroonza commented 2 years ago

This is now mostly under control, but it does still happen (we had two cases this morning, in addition to two fuse crashes on one of the nodes).

Current settings are are follows:

Volume Name: gv_mail
Type: Distributed-Replicate
Volume ID: fdb043f4-f290-4ca3-b747-8c02d917fa75
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: bagheera.interexcel.co.za:/mnt/gluster/mail-a
Brick2: uriel.interexcel.co.za:/mnt/gluster/mail-b
Brick3: uriel.interexcel.co.za:/mnt/gluster/mail-a
Brick4: bagheera.interexcel.co.za:/mnt/gluster/mail-b
Options Reconfigured:
nfs.event-threads: 8
storage.linux-io_uring: off
storage.fips-mode-rchecksum: on
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.least-prio-threads: 4
performance.low-prio-threads: 32
performance.normal-prio-threads: 32
performance.high-prio-threads: 32
performance.io-thread-count: 32
diagnostics.client-log-level: CRITICAL
diagnostics.brick-log-level: WARNING
cluster.granular-entry-heal: enable
cluster.data-self-heal-algorithm: full
performance.open-behind: off
client.event-threads: 2
server.event-threads: 4
performance.cache-size: 1GB
performance.readdir-ahead: off
cluster.data-self-heal: off
cluster.entry-self-heal: off
cluster.metadata-self-heal: off
transport.address-family: inet
nfs.disable: on
performance.write-behind: off
cluster.self-heal-daemon: enable
cluster.locking-scheme: granular
performance.client-io-threads: off
config.client-threads: 4
server.outstanding-rpc-limit: 32
cluster.favorite-child-policy: ctime
config.brick-threads: 64
config.global-threading: on
performance.parallel-readdir: on
cluster.readdir-optimize: on

A few notes:

  1. io_uring performed EXCEPTIONALLY well in terms of performance, but resulted in brick crashes, and subsequent data corruption.

  2. dir_index on ext4 is a performance killer under the best of circumstances, switch it off. Looks very nice on paper, I've yet to find a use-case where it's better than without.

  3. client threading is now completely off, forgot to check in top if any of it's threads was maxing a core at the time of slow-down.

  4. When it gets over a certain point, there is no turning back. We caught one case at a latency of ~20s, removed it from the load balancer, and issued a normall kill (SIGINT) on all processes accessing the mount, load average kept climbing, to the point where when a touch took 1000s we could no longer interact with the system and had to cut power and power back up.

  5. config.global-threading: on is awesome. We only use it on the bricks currently, with config.brick-threads: 64 which spawns double that number of threads, and they are actually mostly busy, can probably get away with 32 at this point but it definitely spikes up at times (our load is very, very variable).

  6. cluster.readdir-optimize: on and performance.parallel-readdir: on works very well. We can now get directory listings over FUSE in a seconds (with 30k dentries), NFS this took 10 minutes (and we needed 8 event threads on NFS just to cope - this was gnfs though, not nfs-ganesha).

  7. NFS cache coherency (part of the client as far as we can tell) eliminated the use of NFS for our use case since it relies on rename() operations happening, frequently not from the same host ...

jkroonza commented 2 years ago

And just now ... another spike in latency, and:

25934 root      20   0 1201888 378768   7628 R  99.9   0.6   1:50.38 glfs_fuseproc

So at this point we're going to set the readers to 2 (we previously used 4, however, with the improved readdir() performance that may no longer be required).

Now, it seems fuse uses a pipe()/socket() of sorts, is there a way to set the buffer size of that pipe/socket? This pipe getting too big may be WHY we're getting the kernel trying to schedule things (as per earlier kernel traces), which then fails somehow ... so this may well be a kernel issue as well, but we may be able to alleviate the issue from the gluster fuse proc by making the buffer that's in use larger? I know by default a pipe(2) is limited to 1MB, but that can be easily changed and I'm quite happy pushing a 16 to 64MB buffer here if viable, unless there is some specific reason NOT to do that?

jkroonza commented 2 years ago

So ... not a kernel FUSE issue I don't think. That's a result of "back pressure". I've been monitoring /sys/fs/fuse/connections/*/waiting, and whenever things starts getting slower, this value is inevitably high. Looking at the congesions values etc, this is set to 48, not sure if that stops the kernel from sending requests to fuse at att

A high value here I reckon implies that glusterfs isn't responding fast enough.

Single reader thread again, since I'm now convinced reader-thread-count>1 causes the crashes described in #3379

I've managed to trigger this now by doing readdir() loops over large folders (140k+ files).

It certainly doesn't trigger every time, but I susect at some point something else happens in parallel that then causes gluster to freak out and stop responding (or at least, lose capacity in some way).

The kernel then starts queueing requests, very likely it also starts sending FUSE_INTERRUPT requests too (since when a client disconnects, then the process will attempt to terminate), which if they aren't processed will result in processes going into uninterruptable wait.

Either way - we need a way to track down the root cause - any pointers would be appreciated. At this point we've actioned a fair amount of risk mitigation so whilst things slow down we rebalance load and remount filesystems somewhat automatically so mostly clients aren't affected (except for "your connection has been interrupted" messages which some people are starting to ask about).

jkroonza commented 2 years ago

Currently we've received the following alerts:

uriel/gluster_check_mounts: touch on /var/spool/mail passed but took 33.895 seconds.
/sys/fs/fuse/connections/21/waiting : 25

Interexcel World Connection, [3 Apr 2022 22:31:25]
uriel/gluster_check_mounts: touch on /var/spool/mail passed but took 22.599 seconds.
/sys/fs/fuse/connections/21/waiting : 32

Interexcel World Connection, [3 Apr 2022 22:32:29]
uriel/gluster_check_mounts: touch on /var/spool/mail passed but took 26.350 seconds.
/sys/fs/fuse/connections/21/waiting : 32

Interexcel World Connection, [3 Apr 2022 22:33:27]
uriel/gluster_check_mounts: touch on /var/spool/mail passed but took 25.132 seconds.
/sys/fs/fuse/connections/21/waiting : 26

Interexcel World Connection, [3 Apr 2022 22:34:32]
uriel/gluster_check_mounts: touch on /var/spool/mail passed but took 30.412 seconds.
/sys/fs/fuse/connections/21/waiting : 22

Interexcel World Connection, [3 Apr 2022 22:35:33]
uriel/gluster_check_mounts: touch on /var/spool/mail passed but took 31.056 seconds.
/sys/fs/fuse/connections/21/waiting : 27

Interexcel World Connection, [3 Apr 2022 22:36:42]
uriel/gluster_check_mounts: touch on /var/spool/mail passed but took 39.885 seconds.
/sys/fs/fuse/connections/21/waiting : 34

Interexcel World Connection, [3 Apr 2022 22:37:33]
uriel/gluster_check_mounts: touch on /var/spool/mail passed but took 30.004 seconds.
/sys/fs/fuse/connections/21/waiting : 28

At this point the only two threads with high CPU consumption (>50% each) is glfs_epoll000 and glfs_epoll001.

This specific client is NOT set to use --global-threading, but it does use whichever of these are applicable:

bagheera [22:41:26] ~ # gluster volume get gv_mail all | grep thread
performance.io-thread-count              32                                     
performance.high-prio-threads            32                                     
performance.normal-prio-threads          32                                     
performance.low-prio-threads             32                                     
performance.least-prio-threads           4                                      
client.event-threads                     2                                      
server.own-thread                        (null) (DEFAULT)                       
server.event-threads                     4                                      
performance.client-io-threads            on                                     
performance.nfs.io-threads               off                                    
nfs.event-threads                        8                                      
features.signer-threads                  4                                      
cluster.shd-max-threads                  1 (DEFAULT)                            
disperse.shd-max-threads                 1 (DEFAULT)                            
glusterd.vol_count_per_thread            100                                    
config.global-threading                  on                                     
config.client-threads                    8                                      
config.brick-threads                     64                                     ```

The two fuse mounts using global-threading performance seems mostly OK, but the FUSE processes crash periodically.

Unfortunately I have no way of seeing what the ~30 outstanding requests to the FUSE process is (that I know of).

I strongly *suspect* though that requests are being lost and as the waiting value increases, so does latency (without additional CPU being used).  Motivation:

1.  We issue a stop for all the mail processes
2.  Notice that a number of courier-imapd (and less frequently courier-pop3d) processes now enter D state.

The reason for these D state processes is (as far as we can determine based on reading FUSE documentation) that reads and writes are generally treated as interruptable wait's (S state), however, when issuing a cancellation request, that request is uninterruptible (if you interrupt a cancellation what should happen?) which then blocks until both the cancellation request is acknowledged, and the original IO request is answered (probably with EINTR).

The moment the mountpoint is force-umounted, all requests are automatically interrupted with EINTR or ENOTCONN (not sure which, but new requests are definitely responded to with ENOTCONN), and released (and subsequently the kill/exit operation can now complete).

Or at the very least, readdir() requests are somehow congesting the works:

```getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <1.477656>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <3.065324>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <2.325886>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <2.170317>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <1.876824>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <1.912398>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <2.208562>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <6.190641>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <2.920089>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <2.274547>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <2.366527>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <3.454485>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <3.044608>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <2.543807>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <2.931346>
getdents64(7, 0x558a9cab76a0 /* 15 entries */, 131072) = 1200 <3.179648>

That's from strace of a process during the above. Perhaps there is something that can be done to improve that situation:

cluster.readdir-optimize                 on                                     
cluster.self-heal-readdir-size           1KB (DEFAULT)                          
performance.readdir-ahead-pass-through   false (DEFAULT)                        
performance.readdir-ahead                off                                    
performance.force-readdirp               true (DEFAULT)                         
nfs.readdir-size                         (1 * 1048576ULL) (DEFAULT)             
dht.force-readdirp                       on (DEFAULT)                           
performance.parallel-readdir             on                                     
jkroonza commented 2 years ago
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <14.943427>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1192 <19.791461>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <9.145474>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <12.365239>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <10.552720>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <8.444900>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <12.399588>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <13.661078>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <12.682923>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <14.554080>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <14.281686>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <13.547868>

Mounts are running with:

/usr/sbin/glusterfs --global-threading --process-name fuse --volfile-server=localhost --volfile-id=gv_mail /var/spool/mail

And options are:

performance.iot-watchdog-secs: 2
performance.iot-pass-through: off
cluster.readdir-optimize: on
performance.parallel-readdir: on
config.global-threading: on
config.brick-threads: 64
cluster.favorite-child-policy: ctime
server.outstanding-rpc-limit: 32
config.client-threads: 16
performance.client-io-threads: off
cluster.locking-scheme: granular
cluster.self-heal-daemon: enable
performance.write-behind: off
nfs.disable: on
transport.address-family: inet
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
cluster.data-self-heal: off
performance.readdir-ahead: off
performance.cache-size: 1GB
server.event-threads: 4
client.event-threads: 2
performance.open-behind: off
cluster.data-self-heal-algorithm: full
cluster.granular-entry-heal: enable
diagnostics.brick-log-level: WARNING
diagnostics.client-log-level: CRITICAL
performance.io-thread-count: 32
performance.high-prio-threads: 16
performance.normal-prio-threads: 16
performance.low-prio-threads: 16
performance.least-prio-threads: 4
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
storage.fips-mode-rchecksum: on
storage.linux-io_uring: off
nfs.event-threads: 8

We've now set performance.iot-pass-through but we don't expect this to make a difference.

For iot modules, please indicate how it's decided what IOPs falls into which thread group w.r.t. priority.

mohit84 commented 2 years ago
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <14.943427>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1192 <19.791461>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <9.145474>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <12.365239>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <10.552720>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <8.444900>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <12.399588>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <13.661078>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <12.682923>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <14.554080>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <14.281686>
getdents64(10, 0x55fb2595dec0 /* 15 entries */, 131072) = 1200 <13.547868>

Mounts are running with:

/usr/sbin/glusterfs --global-threading --process-name fuse --volfile-server=localhost --volfile-id=gv_mail /var/spool/mail

And options are:

performance.iot-watchdog-secs: 2
performance.iot-pass-through: off
cluster.readdir-optimize: on
performance.parallel-readdir: on
config.global-threading: on
config.brick-threads: 64
cluster.favorite-child-policy: ctime
server.outstanding-rpc-limit: 32
config.client-threads: 16
performance.client-io-threads: off
cluster.locking-scheme: granular
cluster.self-heal-daemon: enable
performance.write-behind: off
nfs.disable: on
transport.address-family: inet
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
cluster.data-self-heal: off
performance.readdir-ahead: off
performance.cache-size: 1GB
server.event-threads: 4
client.event-threads: 2
performance.open-behind: off
cluster.data-self-heal-algorithm: full
cluster.granular-entry-heal: enable
diagnostics.brick-log-level: WARNING
diagnostics.client-log-level: CRITICAL
performance.io-thread-count: 32
performance.high-prio-threads: 16
performance.normal-prio-threads: 16
performance.low-prio-threads: 16
performance.least-prio-threads: 4
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
storage.fips-mode-rchecksum: on
storage.linux-io_uring: off
nfs.event-threads: 8

We've now set performance.iot-pass-through but we don't expect this to make a difference.

For iot modules, please indicate how it's decided what IOPs falls into which thread group w.r.t. priority.

After enable global-threading it does not matter to enable/disable iot because the xlator is not use by brick process. In case of global-threading there is no priority concept, the request is processed in FIFO order. For specific to iot the priroty is handled based on the fop , There are two functions (iot_get_pri_meaning and iot_schedule) those decide the priority, all internal fops are having low priority and lookup/open fop is having high priority. For more you can refer these functions.

xhernandez commented 2 years ago

After enable global-threading it does not matter to enable/disable iot because the xlator is not use by brick process.

If I'm not wrong, global-threading option by itself doesn't disable io-threads on brick side. It needs to be explicitly disabled to avoid using the io-threads pool, which could interfere with the global pool.

jkroonza commented 2 years ago

@xhernandez I believe you're right based on what little of the code around this I've looked into. What is still unclear for me is which translators are running where (client vs server vs shd vs glusterd).

Here are the threads currently active (one of the nodes that has both bricks and a fuse mount) - manipulated basically to strip /[0-9]+ on tails of names and to count them:

  - glfs_posixhc (2): S x 2.
  - glusterfsd (1): S x 1.
  - glfs_timer (1): S x 1.
  - glfs_sigwait (1): S x 1.
  - glfs_worker (17): S x 17.
  - glfs_memsweep (1): S x 1.
  - glfs_tpw (65): S x 63, D x 2.
  - glfs_sproc (2): S x 2.
  - glfs_epoll (4): S x 4.
  - glfs_idxwrker (1): S x 1.
  - glfs_posixctxre (1): S x 1.
  - glfs_posixctxja (1): S x 1.
  - glfs_posixfsy (1): S x 1.
brick:/mnt/gluster/mail-a (pid=30784): 97 threads
  - glusterfsd (1): S x 1.
  - glfs_timer (1): S x 1.
  - glfs_sigwait (1): S x 1.
  - glfs_worker (17): S x 17.
  - glfs_memsweep (1): S x 1.
  - glfs_tpw (65): S x 62, D x 3.
  - glfs_sproc (2): S x 2.
  - glfs_epoll (4): S x 4.
  - glfs_idxwrker (1): S x 1.
  - glfs_posixctxre (1): S x 1.
  - glfs_posixhc (1): S x 1.
  - glfs_posixctxja (1): S x 1.
  - glfs_posixfsy (1): S x 1.
fuse:localhost:gv_mail (pid=8881): 44 threads
  - glusterfs (1): S x 1.
  - glfs_timer (1): S x 1.
  - glfs_sigwait (1): S x 1.
  - glfs_worker (17): S x 17.
  - glfs_memsweep (1): S x 1.
  - glfs_tpw (16): S x 16.
  - glfs_sproc (2): S x 2.
  - glfs_epoll (2): S x 2.
  - glfs_fuseproc (1): S x 1.
  - glfs_fusedlyd (1): S x 1.
  - glfs_fusenoti (1): S x 1.
glustershd (pid=7452): 27 threads
  - glfs_sproc (2): S x 2.
  - glusterfs (1): S x 1.
  - glfs_timer (1): S x 1.
  - glfs_sigwait (1): S x 1.
  - glfs_worker (17): S x 17.
  - glfs_memsweep (1): S x 1.
  - glfs_epoll (2): S x 2.
  - glfs_shdheal (2): S x 2.
glusterd (pid=4944): 25 threads
  - glfs_sproc (2): S x 2.
  - glusterd (1): S x 1.
  - glfs_timer (1): S x 1.
  - glfs_sigwait (1): S x 1.
  - glfs_worker (17): S x 17.
  - glfs_memsweep (1): S x 1.
  - glfs_gdhooks (1): S x 1.
  - glfs_epoll (1): S x 1.```

  I'm not sure what's what, but I find it odd that all processes, including glusterd has 17 workers (17 doesn't even match any configuration value anywhere that I can find):

  ```bagheera [13:31:07] ~ # gluster volume get gv_mail all | grep 17
bagheera [13:32:26] ~ # performance.high-prio-threads            16                                     
performance.normal-prio-threads          16                                     
performance.low-prio-threads             16                                     
network.inode-lru-limit                  16384 (DEFAULT)                        
nfs.outstanding-rpc-limit                16 (DEFAULT)                           
features.shard-lru-limit                 16384 (DEFAULT)                        
config.client-threads                    16                                     

So there are a few with setting of 16 - might be that I'm accidentally including a "master" thread in the counts thus getting 17 ... options reconfigured:

performance.iot-watchdog-secs: 2
performance.iot-pass-through: on
cluster.readdir-optimize: on
performance.parallel-readdir: on
config.global-threading: on
config.brick-threads: 64
cluster.favorite-child-policy: ctime
storage.linux-io_uring: off
storage.fips-mode-rchecksum: on
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.least-prio-threads: 4
performance.low-prio-threads: 16
performance.normal-prio-threads: 16
performance.high-prio-threads: 16
performance.io-thread-count: 32
cluster.granular-entry-heal: enable
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.self-heal-daemon: enable
performance.write-behind: off
nfs.disable: on
transport.address-family: inet
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
cluster.data-self-heal: off
performance.readdir-ahead: off
performance.cache-size: 1GB
server.event-threads: 4
client.event-threads: 2
performance.open-behind: off
diagnostics.brick-log-level: WARNING
diagnostics.client-log-level: CRITICAL
performance.client-io-threads: off
config.client-threads: 16
server.outstanding-rpc-limit: 32
nfs.event-threads: 8

No longer using nfs ...

From what we can tell so far (I'm busy deploying a test client on another server), whenever readdir() is happening, if here are multiple readdir() (getdents64) sequences ongoing onto the mountpoint things gets confused, specifically we suspect:

  1. for some reason we cannot fathom the readdir() process slows down.
  2. We suspect FUSE requests go lost and the kernel basically thinks there are still outstanding requests to the FUSE process (/sys/fs/fuse/connections/*/waiting goes up and up).
  3. kernel stops queueing requests when max_background=64 value (in same folder is exceeded, and read-ahead type operations stops when congestion_threshold=48 is reached), which as I understand, if the waiting value exceeds this, the kernel stops queueing to the process except for uninterruptible requests (from what I've deduced, most requests are background requests, including read/write).

The outstanding requests climb QUICKLY in our case, litterally in a matter of seconds it'll go from 2-5 to >48 ... and then it starts a slow climb to 100+ ... once it goes over 64 no further IO happens at all, over 48 things are SLOW, as in REALLY SLOW. We suspect this means stuff is not being overly happy.

Once the filesystem is re-mounted all returns to normal. Sometimes we have to kill (-9) a fuse process that doesn't die with umount -fl (which is also the only way to umount the filesystem at this point).

We're 99.999% certain this relates with readdir() somehow. As keeping files/folder down definitely reduces the frequency of the occurrence.

Is it possible to enable full debug logging (not even sure how verbose that would be) on a single mount only?

jkroonza commented 2 years ago

I cooked a reproduction tool:

rdtorture.txt - rename to rdtorture.c and compile with "gcc -W -Wall -Werror -O2 -lpthread -o rdtorture rdtorture.c".

This didn't help at all to get a crash ... or anywhere near the slowdown we see in production. So the theory is off. I have however learned a few things:

  1. For getdents64 calls to the same folder/directory: the kernel serializes towards glusterfs fuse process (/sys/fs/fuse/connections/??/waiting remains 1 - no matter the concurrency used to readdir the folder here). Not sure if this is libfuse negotiating this way with the kernel, or outright a kernel restriction - and if kernel - at what layer (ie, is this FUSE specific, or a general FS thing). Looking at fs/fuse/readdir.c in the kernel sources, function fuse_readdir() it looks like it may well be fuse specific (the entire process runs under file->private_data->readdir.lock being taken, so if the kernel optimizes that multiple file descriptors references the same struct file then it's fuse restriction at kernel, on a per folder* basis). I think we can live with this since courier-imap doesn't normally perform operations without having tmp/courier.lock which itself serializes things.

  2. glusterfs invariably results in getdents64 returning 15 entries, or 1200 bytes to userspace at a time, even when calling process provides larger, say 4 or 8KB buffers, unclear if this is kernel or glusterfs process (what stops the glusterfs process from handing back say 8KB of data here at a time?) - here is possible room for improvement if this is from glusterfs/libfuse side.

  3. I can't find a way of setting client options on a per client basis. It makes sense in the sense that not all clients are equal, and for some uses of the same data other settings may be relevant, eg, we have separate hosts for incoming email that for POP3/IMAP. Their access patterns into the mail store is very, very different. For me right now, I just want a different log level.

  4. --dump-fuse= could be beneficial. I was thinking that a fusedump (similar to tcpdump) could be useful here, and found this now. Will enable once I can reproduce, and after testing performance impact (if any) I'm quite happy to consider enabling on production if this could help. https://github.com/gluster/parsefuse refers.

  5. glusterfs FUSE tpw threads works much, much harder than the rdtorture application :). This is to be expected since rdtorture is a relatively simple program all things considered.

rdtorture usage:

rdtorture [--nthreads X] [--nsecs Y] path [...]

--nthreads defaults to 1. --nsecs defaults to infinity. You can also stop the process with ^C, first one will exit gracefully (ie, wait for all current threads to finish and output histogram), second one will hard-exit().

Below, when I reference one folder it's a folder with 39 sub-folders max two layers, 138k files in one of those folders, just over 140k files total. No other processses were touching the mount during these tests.

Some basic stats for getdents64() for FUSE client for devel branch:

./rdtorture --nthreads 1 --nsecs 600 /single/folder
getdents64:   610318.162ms /   308383 calls (avg=    1.979ms, worst=82.410ms):
<     128ms:       3 (  0.00 %) 
<      64ms:      74 (  0.02 %) 
<      32ms:     176 (  0.06 %) 
<      16ms:     185 (  0.06 %) 
<       8ms:     307 (  0.10 %) 
<       4ms:  128511 ( 41.67 %) ********************
<       2ms:  167470 ( 54.31 %) ***************************
<       1ms:   11657 (  3.78 %) *

./rdtorture --nthreads 2 --nsecs 600 /single/folder
getdents64:  1258049.899ms /   346932 calls (avg=    3.626ms, worst=102.994ms):
<     128ms:       2 (  0.00 %) 
<      64ms:      87 (  0.03 %) 
<      32ms:     219 (  0.06 %) 
<      16ms:     485 (  0.14 %) 
<       8ms:   98950 ( 28.52 %) **************
<       4ms:  176164 ( 50.78 %) *************************
<       2ms:   58192 ( 16.77 %) ********
<       1ms:   12833 (  3.70 %) *

./rdtorture --nthreads 10 --nsecs 600 /single/folder
getdents64: 104219871.948ms /   287553 calls (avg=  362.437ms, worst=466226.550ms):
<  524288ms:       5 (  0.00 %) 
<  262144ms:       0 (  0.00 %) 
<  131072ms:       0 (  0.00 %) 
<   65536ms:       6 (  0.00 %) 
<   32768ms:      21 (  0.01 %) 
<   16384ms:      65 (  0.02 %) 
<    8192ms:     478 (  0.17 %) 
<    4096ms:   13484 (  4.69 %) **
<    2048ms:   32021 ( 11.14 %) *****
<    1024ms:   12037 (  4.19 %) **
<     512ms:    4206 (  1.46 %) 
<     256ms:    2000 (  0.70 %) 
<     128ms:     881 (  0.31 %) 
<      64ms:    1519 (  0.53 %) 
<      32ms:  117609 ( 40.90 %) ********************
<      16ms:   79779 ( 27.74 %) *************
<       8ms:    9285 (  3.23 %) *
<       4ms:     873 (  0.30 %) 
<       2ms:    6298 (  2.19 %) *
<       1ms:    6986 (  2.43 %) *

You can see how the 2-thread case on average took twice as long as the single-thread, and the 10-thread 10x as long as the 2-thread. Which is scary, but this is definitely a kernel issue (as described above). I don't think this is too relevant for our use-case fortunately.

The 10-thread case failed at some point due to ENOTCONN, it seems first the brick processes became unreacable, and then dmesg reported the following D state processes:

Apr  6 19:25:13 arthur kernel: INFO: task glfs_fusenoti:4442 blocked for more than 122 seconds.
Apr  6 19:25:13 arthur kernel:       Tainted: G        W  O      5.8.14-uls #1
Apr  6 19:25:13 arthur kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  6 19:25:13 arthur kernel: glfs_fusenoti   D    0  4442      1 0x00000000
Apr  6 19:25:13 arthur kernel: Call Trace:
Apr  6 19:25:13 arthur kernel:  __schedule+0x281/0x5f0
Apr  6 19:25:13 arthur kernel:  ? rwsem_optimistic_spin+0xaf/0x1e0
Apr  6 19:25:13 arthur kernel:  schedule+0x45/0xb0
Apr  6 19:25:13 arthur kernel:  rwsem_down_write_slowpath+0x1dd/0x420
Apr  6 19:25:13 arthur kernel:  fuse_reverse_inval_entry+0x46/0x1e0
Apr  6 19:25:13 arthur kernel:  fuse_dev_do_write+0x42a/0xd10
Apr  6 19:25:13 arthur kernel:  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
Apr  6 19:25:13 arthur kernel:  fuse_dev_write+0x41/0x60
Apr  6 19:25:13 arthur kernel:  do_iter_readv_writev+0x12a/0x190
Apr  6 19:25:13 arthur kernel:  do_iter_write+0x78/0x190
Apr  6 19:25:13 arthur kernel:  vfs_writev+0x7b/0xd0
Apr  6 19:25:13 arthur kernel:  ? __hrtimer_run_queues+0xec/0x170
Apr  6 19:25:13 arthur kernel:  ? recalibrate_cpu_khz+0x10/0x10
Apr  6 19:25:13 arthur kernel:  ? ktime_get+0x33/0x90
Apr  6 19:25:13 arthur kernel:  ? lapic_next_event+0x20/0x20
Apr  6 19:25:13 arthur kernel:  ? clockevents_program_event+0x8f/0xf0
Apr  6 19:25:13 arthur kernel:  ? hrtimer_interrupt+0x133/0x220
Apr  6 19:25:13 arthur kernel:  do_writev+0x52/0xe0
Apr  6 19:25:13 arthur kernel:  do_syscall_64+0x3e/0x70
Apr  6 19:25:13 arthur kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr  6 19:25:13 arthur kernel: RIP: 0033:0x7f8c107ae50d
Apr  6 19:25:13 arthur kernel: Code: Bad RIP value.
Apr  6 19:25:13 arthur kernel: RSP: 002b:00007f8bab7fcfa0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
Apr  6 19:25:13 arthur kernel: RAX: ffffffffffffffda RBX: 0000562ad856f318 RCX: 00007f8c107ae50d
Apr  6 19:25:13 arthur kernel: RDX: 0000000000000001 RSI: 00007f8bab7fd000 RDI: 0000000000000008
Apr  6 19:25:13 arthur kernel: RBP: 0000562ad857b490 R08: 0000000000000000 R09: 0000000000000058
Apr  6 19:25:13 arthur kernel: R10: 0000000000000070 R11: 0000000000000293 R12: 0000562ad857b4d0
Apr  6 19:25:13 arthur kernel: R13: 00007f8bd4ebcb08 R14: 00007f8bab7fd000 R15: 00007f8bd4ebcae8
Apr  6 19:25:13 arthur kernel: INFO: task rdtorture:21000 blocked for more than 122 seconds.
Apr  6 19:25:13 arthur kernel:       Tainted: G        W  O      5.8.14-uls #1
Apr  6 19:25:13 arthur kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  6 19:25:13 arthur kernel: rdtorture       D    0 21000  15238 0x00000000
Apr  6 19:25:13 arthur kernel: Call Trace:
Apr  6 19:25:13 arthur kernel:  __schedule+0x281/0x5f0
Apr  6 19:25:13 arthur kernel:  schedule+0x45/0xb0
Apr  6 19:25:13 arthur kernel:  schedule_preempt_disabled+0x5/0x10
Apr  6 19:25:13 arthur kernel:  __mutex_lock.isra.0+0x247/0x4a0
Apr  6 19:25:13 arthur kernel:  ? __alloc_pages_nodemask+0xe2/0x180
Apr  6 19:25:13 arthur kernel:  fuse_lock_inode+0x23/0x30
Apr  6 19:25:13 arthur kernel:  fuse_readdir_uncached+0x150/0x6e0
Apr  6 19:25:13 arthur kernel:  fuse_readdir+0x12a/0x5a0
Apr  6 19:25:13 arthur kernel:  iterate_dir+0x14e/0x160
Apr  6 19:25:13 arthur kernel:  ksys_getdents64+0x64/0xf0
Apr  6 19:25:13 arthur kernel:  ? do_vfs_ioctl.cold+0x42/0x42
Apr  6 19:25:13 arthur kernel:  __x64_sys_getdents64+0x11/0x20
Apr  6 19:25:13 arthur kernel:  do_syscall_64+0x3e/0x70
Apr  6 19:25:13 arthur kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr  6 19:25:13 arthur kernel: RIP: 0033:0x7f74616e4f99
Apr  6 19:25:13 arthur kernel: Code: Bad RIP value.
Apr  6 19:25:13 arthur kernel: RSP: 002b:00007f74615eaa38 EFLAGS: 00000246 ORIG_RAX: 00000000000000d9
Apr  6 19:25:13 arthur kernel: RAX: ffffffffffffffda RBX: 00000000000005b0 RCX: 00007f74616e4f99
Apr  6 19:25:13 arthur kernel: RDX: 0000000000001000 RSI: 00007f74615eab20 RDI: 0000000000000017
Apr  6 19:25:13 arthur kernel: RBP: 00000000000005b0 R08: 00007f74615ea8c5 R09: 00000000000005b0
Apr  6 19:25:13 arthur kernel: R10: 00000000882bb276 R11: 0000000000000246 R12: 00007f74615eab20
Apr  6 19:25:13 arthur kernel: R13: 00007f74615eddd8 R14: 00007f74615eb068 R15: 00007f74615eaa70
Apr  6 19:25:13 arthur kernel: INFO: task rdtorture:21001 blocked for more than 122 seconds.
Apr  6 19:25:13 arthur kernel:       Tainted: G        W  O      5.8.14-uls #1
Apr  6 19:25:13 arthur kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  6 19:25:13 arthur kernel: rdtorture       D    0 21001  15238 0x00000000
Apr  6 19:25:13 arthur kernel: Call Trace:
Apr  6 19:25:13 arthur kernel:  __schedule+0x281/0x5f0
Apr  6 19:25:13 arthur kernel:  schedule+0x45/0xb0
Apr  6 19:25:13 arthur kernel:  schedule_preempt_disabled+0x5/0x10
Apr  6 19:25:13 arthur kernel:  __mutex_lock.isra.0+0x247/0x4a0
Apr  6 19:25:13 arthur kernel:  ? __alloc_pages_nodemask+0xe2/0x180
Apr  6 19:25:13 arthur kernel:  fuse_lock_inode+0x23/0x30
Apr  6 19:25:13 arthur kernel:  fuse_readdir_uncached+0x150/0x6e0
Apr  6 19:25:13 arthur kernel:  fuse_readdir+0x12a/0x5a0
Apr  6 19:25:13 arthur kernel:  iterate_dir+0x14e/0x160
Apr  6 19:25:13 arthur kernel:  ksys_getdents64+0x64/0xf0
Apr  6 19:25:13 arthur kernel:  ? do_vfs_ioctl.cold+0x42/0x42
Apr  6 19:25:13 arthur kernel:  __x64_sys_getdents64+0x11/0x20
Apr  6 19:25:13 arthur kernel:  do_syscall_64+0x3e/0x70
Apr  6 19:25:13 arthur kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr  6 19:25:13 arthur kernel: RIP: 0033:0x7f74616e4f99
Apr  6 19:25:13 arthur kernel: Code: Bad RIP value.
Apr  6 19:25:13 arthur kernel: RSP: 002b:00007f745e5e4a38 EFLAGS: 00000246 ORIG_RAX: 00000000000000d9
Apr  6 19:25:13 arthur kernel: RAX: ffffffffffffffda RBX: 00000000000005b0 RCX: 00007f74616e4f99
Apr  6 19:25:13 arthur kernel: RDX: 0000000000001000 RSI: 00007f745e5e4b20 RDI: 0000000000000018
Apr  6 19:25:13 arthur kernel: RBP: 00000000000005b0 R08: 00007f745e5e48c5 R09: 00000000000005b0
Apr  6 19:25:13 arthur kernel: R10: 00000000882bb736 R11: 0000000000000246 R12: 00007f745e5e4b20
Apr  6 19:25:13 arthur kernel: R13: 00007f745e5e7dd8 R14: 00007f745e5e5068 R15: 00007f745e5e4a70
Apr  6 19:25:13 arthur kernel: INFO: task rdtorture:21003 blocked for more than 122 seconds.
Apr  6 19:25:13 arthur kernel:       Tainted: G        W  O      5.8.14-uls #1
Apr  6 19:25:13 arthur kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  6 19:25:13 arthur kernel: rdtorture       D    0 21003  15238 0x00000000
Apr  6 19:25:13 arthur kernel: Call Trace:
Apr  6 19:25:13 arthur kernel:  __schedule+0x281/0x5f0
Apr  6 19:25:13 arthur kernel:  schedule+0x45/0xb0
Apr  6 19:25:13 arthur kernel:  schedule_preempt_disabled+0x5/0x10
Apr  6 19:25:13 arthur kernel:  __mutex_lock.isra.0+0x247/0x4a0
Apr  6 19:25:13 arthur kernel:  ? __alloc_pages_nodemask+0xe2/0x180
Apr  6 19:25:13 arthur kernel:  fuse_lock_inode+0x23/0x30
Apr  6 19:25:13 arthur kernel:  fuse_readdir_uncached+0x150/0x6e0
Apr  6 19:25:13 arthur kernel:  fuse_readdir+0x12a/0x5a0
Apr  6 19:25:13 arthur kernel:  iterate_dir+0x14e/0x160
Apr  6 19:25:13 arthur kernel:  ksys_getdents64+0x64/0xf0
Apr  6 19:25:13 arthur kernel:  ? do_vfs_ioctl.cold+0x42/0x42
Apr  6 19:25:13 arthur kernel:  __x64_sys_getdents64+0x11/0x20
Apr  6 19:25:13 arthur kernel:  do_syscall_64+0x3e/0x70
Apr  6 19:25:13 arthur kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr  6 19:25:13 arthur kernel: RIP: 0033:0x7f74616e4f99
Apr  6 19:25:13 arthur kernel: Code: Bad RIP value.
Apr  6 19:25:13 arthur kernel: RSP: 002b:00007f745ede5a38 EFLAGS: 00000246 ORIG_RAX: 00000000000000d9
Apr  6 19:25:13 arthur kernel: RAX: ffffffffffffffda RBX: 00000000000005b0 RCX: 00007f74616e4f99
Apr  6 19:25:13 arthur kernel: RDX: 0000000000001000 RSI: 00007f745ede5b20 RDI: 0000000000000019
Apr  6 19:25:13 arthur kernel: RBP: 00000000000005b0 R08: 00007f745ede58c5 R09: 00000000000005b0
Apr  6 19:25:13 arthur kernel: R10: 00000000882bb7c0 R11: 0000000000000246 R12: 00007f745ede5b20
Apr  6 19:25:13 arthur kernel: R13: 00007f745ede8dd8 R14: 00007f745ede6068 R15: 00007f745ede5a70
Apr  6 19:25:13 arthur kernel: INFO: task rdtorture:21004 blocked for more than 122 seconds.
Apr  6 19:25:13 arthur kernel:       Tainted: G        W  O      5.8.14-uls #1
Apr  6 19:25:13 arthur kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  6 19:25:13 arthur kernel: rdtorture       D    0 21004  15238 0x00000000
Apr  6 19:25:13 arthur kernel: Call Trace:
Apr  6 19:25:13 arthur kernel:  __schedule+0x281/0x5f0
Apr  6 19:25:13 arthur kernel:  schedule+0x45/0xb0
Apr  6 19:25:13 arthur kernel:  schedule_preempt_disabled+0x5/0x10
Apr  6 19:25:13 arthur kernel:  __mutex_lock.isra.0+0x247/0x4a0
Apr  6 19:25:13 arthur kernel:  ? __alloc_pages_nodemask+0xe2/0x180
Apr  6 19:25:13 arthur kernel:  fuse_lock_inode+0x23/0x30
Apr  6 19:25:13 arthur kernel:  fuse_readdir_uncached+0x150/0x6e0
Apr  6 19:25:13 arthur kernel:  fuse_readdir+0x12a/0x5a0
Apr  6 19:25:13 arthur kernel:  iterate_dir+0x14e/0x160
Apr  6 19:25:13 arthur kernel:  ksys_getdents64+0x64/0xf0
Apr  6 19:25:13 arthur kernel:  ? do_vfs_ioctl.cold+0x42/0x42
Apr  6 19:25:13 arthur kernel:  __x64_sys_getdents64+0x11/0x20
Apr  6 19:25:13 arthur kernel:  do_syscall_64+0x3e/0x70
Apr  6 19:25:13 arthur kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr  6 19:25:13 arthur kernel: RIP: 0033:0x7f74616e4f99
Apr  6 19:25:13 arthur kernel: Code: Bad RIP value.
Apr  6 19:25:13 arthur kernel: RSP: 002b:00007f74605e8a38 EFLAGS: 00000246 ORIG_RAX: 00000000000000d9
Apr  6 19:25:13 arthur kernel: RAX: ffffffffffffffda RBX: 00000000000005b0 RCX: 00007f74616e4f99
Apr  6 19:25:13 arthur kernel: RDX: 0000000000001000 RSI: 00007f74605e8b20 RDI: 000000000000001b
Apr  6 19:25:13 arthur kernel: RBP: 00000000000005b0 R08: 00007f74605e88c5 R09: 00000000000005b0
Apr  6 19:25:13 arthur kernel: R10: 00000000882bb4c2 R11: 0000000000000246 R12: 00007f74605e8b20
Apr  6 19:25:13 arthur kernel: R13: 00007f74605ebdd8 R14: 00007f74605e9068 R15: 00007f74605e8a70

Unfortunately when I re-installed now to v10.1 I seem to have lost the FUSE client logs, but essentially it just reported 4 x lost connections due to 42s inactivity from the bricks. No other systems reported outages during that time, so it was something specific to arthur.

For the 1, 2, 10 and 40-thread case giving almost 3500 independent base folders (10m files overall spread between them), things looks fairly normal for all concurrencies (I suspect the 10-thread case most closely represents our actual use-case for the most part):

./rdtorture --nthreads 1 --nsecs 600 /multiple /folder
getdents64:   561940.566ms /    39240 calls (avg=   14.320ms, worst=6961.682ms):
<    8192ms:       3 (  0.01 %) 
<    4096ms:       6 (  0.02 %) 
<    2048ms:      91 (  0.23 %) 
<    1024ms:     144 (  0.37 %) 
<     512ms:     115 (  0.29 %) 
<     256ms:     284 (  0.72 %) 
<     128ms:     766 (  1.95 %) 
<      64ms:    1265 (  3.22 %) *
<      32ms:    1554 (  3.96 %) *
<      16ms:    1421 (  3.62 %) *
<       8ms:    1263 (  3.22 %) *
<       4ms:    4227 ( 10.77 %) *****
<       2ms:    7120 ( 18.14 %) *********
<       1ms:   20981 ( 53.47 %) **************************

./rdtorture --nthreads 2 --nsecs 600 /multiple /folder
getdents64:  1117400.397ms /    79748 calls (avg=   14.011ms, worst=11786.990ms):
<   16384ms:       2 (  0.00 %) 
<    8192ms:      25 (  0.03 %) 
<    4096ms:      76 (  0.10 %) 
<    2048ms:      86 (  0.11 %) 
<    1024ms:      96 (  0.12 %) 
<     512ms:     195 (  0.24 %) 
<     256ms:     782 (  0.98 %) 
<     128ms:    1441 (  1.81 %) 
<      64ms:    1964 (  2.46 %) *
<      32ms:    2045 (  2.56 %) *
<      16ms:    1504 (  1.89 %) 
<       8ms:    1085 (  1.36 %) 
<       4ms:    7510 (  9.42 %) ****
<       2ms:   20748 ( 26.02 %) *************
<       1ms:   42189 ( 52.90 %) **************************

./rdtorture --nthreads 10 --nsecs 600 /multiple /folder
getdents64:  5874896.779ms /   154619 calls (avg=   37.995ms, worst=29609.416ms):
<   32768ms:      27 (  0.02 %) 
<   16384ms:     138 (  0.09 %) 
<    8192ms:      96 (  0.06 %) 
<    4096ms:     120 (  0.08 %) 
<    2048ms:     293 (  0.19 %) 
<    1024ms:    1018 (  0.66 %) 
<     512ms:    2184 (  1.41 %) 
<     256ms:    2333 (  1.51 %) 
<     128ms:    2037 (  1.32 %) 
<      64ms:    2422 (  1.57 %) 
<      32ms:    2090 (  1.35 %) 
<      16ms:    1645 (  1.06 %) 
<       8ms:    2222 (  1.44 %) 
<       4ms:   25680 ( 16.61 %) ********
<       2ms:   37361 ( 24.16 %) ************
<       1ms:   74953 ( 48.48 %) ************************

./rdtorture --nthreads 40 --nsecs 600 /multiple /folder
getdents64: 19816864.079ms /   180476 calls (avg=  109.803ms, worst=183587.129ms):
<  262144ms:       3 (  0.00 %) 
<  131072ms:      28 (  0.02 %) 
<   65536ms:      21 (  0.01 %) 
<   32768ms:     146 (  0.08 %) 
<   16384ms:     144 (  0.08 %) 
<    8192ms:     187 (  0.10 %) 
<    4096ms:     957 (  0.53 %) 
<    2048ms:    2037 (  1.13 %) 
<    1024ms:    2355 (  1.30 %) 
<     512ms:    2580 (  1.43 %) 
<     256ms:    2183 (  1.21 %) 
<     128ms:    2109 (  1.17 %) 
<      64ms:    2282 (  1.26 %) 
<      32ms:    2880 (  1.60 %) 
<      16ms:    8476 (  4.70 %) **
<       8ms:   52906 ( 29.31 %) **************
<       4ms:   42927 ( 23.79 %) ***********
<       2ms:   36691 ( 20.33 %) **********
<       1ms:   21564 ( 11.95 %) *****

Except for serious outliers everything here seems fair, and if this is the norm, I'm willing to deal with the sporadic timeouts the above slow cases will cause (most clients disconnects and retries after 60s, but can be reconfigured to 60s ... but users complain long before that). Worst case above at about 4m22s which will (given that it represents 15 entries) result in problems, however, concurrencies of 40 from what I've seen is not our norm under normal operation (perhaps that's exactly the issue and we should just open the taps by doubling the congestion_threshold and max_background values in /sys/fs/fuse/connections/?? ?

Using the v10.1 we get the following:

./rdtorture --nthreads 1 --nsecs 600 /single
getdents64:   611757.321ms /   308456 calls (avg=    1.983ms, worst=68.513ms):
<     128ms:       1 (  0.00 %) 
<      64ms:      31 (  0.01 %) 
<      32ms:     178 (  0.06 %) 
<      16ms:     135 (  0.04 %) 
<       8ms:     411 (  0.13 %) 
<       4ms:  165901 ( 53.78 %) **************************
<       2ms:  129966 ( 42.13 %) *********************
<       1ms:   11833 (  3.84 %) *

./rdtorture --nthreads 2 --nsecs 600 /single
getdents64:  1263641.093ms /   347004 calls (avg=    3.641ms, worst=58.997ms):
<      64ms:      57 (  0.02 %) 
<      32ms:     229 (  0.07 %) 
<      16ms:     431 (  0.12 %) 
<       8ms:   99541 ( 28.69 %) **************
<       4ms:  178065 ( 51.31 %) *************************
<       2ms:   55834 ( 16.09 %) ********
<       1ms:   12847 (  3.70 %) *

./rdtorture --nthreads 10 --nsecs 600 /single
getdents64: 79331212.417ms /   288124 calls (avg=  275.337ms, worst=195434.748ms):
<  262144ms:       9 (  0.00 %) 
<  131072ms:       0 (  0.00 %) 
<   65536ms:       8 (  0.00 %) 
<   32768ms:      36 (  0.01 %) 
<   16384ms:      83 (  0.03 %) 
<    8192ms:     669 (  0.23 %) 
<    4096ms:   12005 (  4.17 %) **
<    2048ms:   15890 (  5.51 %) **
<    1024ms:    9853 (  3.42 %) *
<     512ms:    9274 (  3.22 %) *
<     256ms:    3565 (  1.24 %) 
<     128ms:    1091 (  0.38 %) 
<      64ms:    3428 (  1.19 %) 
<      32ms:  125394 ( 43.52 %) *********************
<      16ms:   78037 ( 27.08 %) *************
<       8ms:    8177 (  2.84 %) *
<       4ms:    1801 (  0.63 %) 
<       2ms:   11817 (  4.10 %) **
<       1ms:    6987 (  2.42 %) *

./rdtorture --nthreads 40 --nsecs 600 /single
getdents64: 25196111.896ms /   385600 calls (avg=   65.342ms, worst=220.514ms):
<     256ms:     243 (  0.06 %) 
<     128ms:  270245 ( 70.08 %) ***********************************
<      64ms:   41784 ( 10.84 %) *****
<      32ms:    1724 (  0.45 %) 
<      16ms:    1138 (  0.30 %) 
<       8ms:    1589 (  0.41 %) 
<       4ms:    3548 (  0.92 %) 
<       2ms:   53825 ( 13.96 %) ******
<       1ms:   11504 (  2.98 %) *

This actually looks a lot better that the devel branch, but it was also run much later at night, so expect the bricks were a lot more quiet. For the many more folders case:

./rdtorture --nthreads 1 --nsecs 600 /more /than /one
getdents64:   501743.806ms /   125874 calls (avg=    3.986ms, worst=5974.078ms):
<    8192ms:       4 (  0.00 %) 
<    4096ms:       7 (  0.01 %) 
<    2048ms:      36 (  0.03 %) 
<    1024ms:     108 (  0.09 %) 
<     512ms:      87 (  0.07 %) 
<     256ms:     128 (  0.10 %) 
<     128ms:     458 (  0.36 %) 
<      64ms:    1025 (  0.81 %) 
<      32ms:    1660 (  1.32 %) 
<      16ms:    1448 (  1.15 %) 
<       8ms:    1092 (  0.87 %) 
<       4ms:   24290 ( 19.30 %) *********
<       2ms:   31528 ( 25.05 %) ************
<       1ms:   64003 ( 50.85 %) *************************

./rdtorture --nthreads 2 --nsecs 600 /more /than /one
getdents64:  1054552.759ms /   239113 calls (avg=    4.410ms, worst=6938.351ms):
<    8192ms:      12 (  0.01 %) 
<    4096ms:      42 (  0.02 %) 
<    2048ms:      72 (  0.03 %) 
<    1024ms:     103 (  0.04 %) 
<     512ms:     173 (  0.07 %) 
<     256ms:     551 (  0.23 %) 
<     128ms:    1141 (  0.48 %) 
<      64ms:    1806 (  0.76 %) 
<      32ms:    3176 (  1.33 %) 
<      16ms:    2663 (  1.11 %) 
<       8ms:    2085 (  0.87 %) 
<       4ms:   23822 (  9.96 %) ****
<       2ms:   79901 ( 33.42 %) ****************
<       1ms:  123566 ( 51.68 %) *************************

./rdtorture --nthreads 10 --nsecs 600 /more /than /one
getdents64:  5994154.233ms /   330316 calls (avg=   18.146ms, worst=35930.287ms):
<   65536ms:       1 (  0.00 %) 
<   32768ms:      29 (  0.01 %) 
<   16384ms:     123 (  0.04 %) 
<    8192ms:     109 (  0.03 %) 
<    4096ms:     113 (  0.03 %) 
<    2048ms:     273 (  0.08 %) 
<    1024ms:    1190 (  0.36 %) 
<     512ms:    1732 (  0.52 %) 
<     256ms:    1881 (  0.57 %) 
<     128ms:    1793 (  0.54 %) 
<      64ms:    2830 (  0.86 %) 
<      32ms:    3503 (  1.06 %) 
<      16ms:    3202 (  0.97 %) 
<       8ms:    5095 (  1.54 %) 
<       4ms:   61587 ( 18.64 %) *********
<       2ms:   84484 ( 25.58 %) ************
<       1ms:  162371 ( 49.16 %) ************************

./rdtorture --nthreads 40 --nsecs 600 /more /than /one
getdents64: 46510364.872ms /    87482 calls (avg=  531.656ms, worst=461651.669ms):
<  524288ms:      10 (  0.01 %) 
<  262144ms:       5 (  0.01 %) 
<  131072ms:       6 (  0.01 %) 
<   65536ms:      46 (  0.05 %) 
<   32768ms:      22 (  0.03 %) 
<   16384ms:      66 (  0.08 %) 
<    8192ms:    1162 (  1.33 %) 
<    4096ms:    6925 (  7.92 %) ***
<    2048ms:    6236 (  7.13 %) ***
<    1024ms:    2480 (  2.83 %) *
<     512ms:     981 (  1.12 %) 
<     256ms:     570 (  0.65 %) 
<     128ms:     397 (  0.45 %) 
<      64ms:     486 (  0.56 %) 
<      32ms:     707 (  0.81 %) 
<      16ms:    2122 (  2.43 %) *
<       8ms:   22932 ( 26.21 %) *************
<       4ms:   18175 ( 20.78 %) **********
<       2ms:   16261 ( 18.59 %) *********
<       1ms:    7893 (  9.02 %) ****

So with isolated open(), readdir/getdents64(), openat() and close() I don't see the badness quite as badly. I'm contemplating throwing the histogram long-side down and also (but not graphing) the aggregate, eg (<1 ms, 9.02 % / 9.02 %, <2ms, 18.59 % / 27.61 % kind of thing).

Currently my thinking is that renames / removes might be causing problems for readdir, so want to try and construct a completely off-production systems test case ... as things stand, I doubt is readdir() in isolation ... but given what we've seen with strace on production nodes I'm still convinced it's involved somehow.

jkroonza commented 2 years ago

Ok, actions taken:

  1. Disabled global-threading again (to try and see if that makes a difference, very, very subjectively I'd say very little if any).
  2. Added background-qlen=128 to the two nodes where we're sending more traffic.
  3. Added two new clean bricks, and initiated a remove-brick on one of the other pairs (eta 350h, so hoping that will speed up overnight - surely it shouldn't take 400h to shunt around ~1.5TB of data, at line 100Mbps which is 10% of possible here that should not be more than 34h ... - is there a way to speed this up? increasing shd threads?)

I've enabled state dumps on the FUSE processes now whenever we detect >15s, not sure how to interpret these just yet, attached the ones we've got so far:

glusterdump.29892.dump.1649336632.gz glusterdump.9706.dump.1649336632.gz glusterdump.28784.dump.1649336632.gz glusterdump.28784.dump.1649334276.gz glusterdump.7073.dump.1649336632.gz

Reading through them the only thing that really stands out for me is this:

[global.callpool]
callpool_address=0x55c0d04e8b58
callpool.cnt=1

This in spite of:

performance.client-io-threads: on
performance.io-thread-count: 32
performance.high-prio-threads: 16
performance.normal-prio-threads: 16
performance.low-prio-threads: 16
performance.least-prio-threads: 4
performance.iot-pass-through: off

Checking the other dumps as well, we've got 3 state dumps with cnt=1, and one with cnt=4, that's a total of 7 operations when according to the kernel waiting is much, much higher than that.

These operations are:

glusterdump.28784.dump.1649334276.gz:op=READDIRP
glusterdump.28784.dump.1649334276.gz:op=READDIRP
glusterdump.28784.dump.1649334276.gz:op=READDIRP
glusterdump.28784.dump.1649334276.gz:op=OPENDIR

glusterdump.29892.dump.1649336632.gz:op=FLUSH

glusterdump.7073.dump.1649336632.gz:op=READDIRP

glusterdump.9706.dump.1649336632.gz:op=FLUSH

So is glusterfs in spite of those setting still only processing a single (or a handful of) task(s) at a time? Which would definitely explain things getting slow!

Lock contention somewhere perhaps getting these operations as requested by the kernel queued to the bricks? Or just me not understanding these state dumps at all?

jkroonza commented 2 years ago

Correlated kernel waiting and state dump:

/sys/fs/fuse/connections/23/waiting : 203

glusterdump.21456.dump.1649341674.gz

grep ^op= gives 305 ops, far majority of which are LOOKUP:

    113 op=LOOKUP
     43 op=STAT
     32 op=SETATTR
     30 op=READDIRP
     29 op=stack
     23 op=OPENDIR
     10 op=WRITE
     10 op=CREATE
      5 op=FLUSH
      4 op=UNLINK
      2 op=RENAME
      1 op=READ
      1 op=OPEN
      1 op=LINK
      1 op=FSYNC

Oh wow, was I wrong. Anyway, would LOVE to know what you make of this please. Would also like to understand if a thread is dedicated to a task until it's done (ie, can block waiting for a response from the bricks), or whether it goes back into a work-pool pending a response from the servers, in other words, are we restricted to 32 of the above operations being executed at any point in time?

Similar for the threads on the server - are they handling ONE IO task to completion (eg, receive a LOOKUP from a client, then handle that and respond then move on to the next?).

Guessing the question is - what can we change to get a higher throughput here? Is there any way to graph over time the IOPs? Some form of SNMP integration perhaps where we can query the values every minute or so and graph that?

jkroonza commented 2 years ago

We noticed the following:

At times one of the brick processes on uriel would spin out and consume all available CPU (24 cores). It crashed at a point and all latency issues was suddenly resolved. We proceeded to rename the brick folder prior to rm -rf. This repeatedly resulted in the other brick processes on the same filesystem to crash. We decided to offline all the bricks here, and reformat the filesystem. Whilst heal is ongoing we've not seen cases of latency spikes.

We have seen other issues eg #3436 which is quite annoying.

Heal is progressing slowly (estimated around 10% currently based on disk usage).

fsck.ext4 didn't pick up anything on the filesystem at some point after this mess started, so we're fairly certain it's some glusterfs attribute or similar that was not correctly set or some shit at some point. Not sure if this is something that can be investigated from the core dumps?

jkroonza commented 2 years ago

We're writing this off as poor performance with the distribute AFR when using lots, and lots of renames. Specifically, using maildir which relies on rename(). As we move more and more data to a pure replicate glusterfs pair performance overall is definitely improving.

Fortunately for this specific use-case distribute is not required.