gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.71k stars 1.08k forks source link

Slow directory listings over GlusterFS FUSE in Gluster 10.1 #3980

Open thalin opened 1 year ago

thalin commented 1 year ago

Description of problem: We have a directory with ~4300 files which takes about 15 seconds to run an ll against:

time ll /mnt/gluster/store|wc -l 
4352

real    0m14.979s
user    0m0.027s
sys 0m0.059s

This seems excessive. Each of these bricks is an SSD, so drive access should be relatively fast. Listing each brick (distributed-dispersed) is much faster; on the order of 0.05s with most of the same number of directories:

time ll /data/glusterfs/the-volume/brick-4/store/|wc -l
4342

real    0m0.054s
user    0m0.022s
sys 0m0.033s

I've seen this take between 0.01 and 0.055s on a brick, so even with a sequential read of every brick, this shouldn't take more than 2.25 seconds (and iiuc there shouldn't be a sequential read, it should be in parallel). Here's a sequential ll against the offending directory on all of the bricks on one of the three machines:

time for brick in $(sudo gluster volume info the-volume|grep Brick|grep gluster1|awk -F: '{print $3}'); do ll $brick/store > /dev/null; done

real    0m0.774s
user    0m0.316s
sys 0m0.392s

Even cached, the ll on a GlusterFS mount still takes ~7 seconds (when I run the command a couple of times in a row quickly).

I looked at some previous Gluster issues which seemed similar, such as this one which mentions having to do lots of getdents, so I went ahead and did an strace against a ls on this problem directory to see if there is any similar behavior (even though that bug is for gluster 3.x and we're on 10.1 on both client and server).

In my case I see a lot of getdents64 - 2241 of them in the strace I captured. So, most of these are returning only 1 entry! This seems crazy (strace provided on request as noted below).

The exact command to reproduce the issue: ls

The full output of the command that failed: it didn't fail, it's just slow

Expected results: I would expect that it wouldn't be this slow.

Mandatory info: - The output of the gluster volume info command:

Volume Name: the-volume
Type: Distributed-Disperse
Volume ID: 26fb3836-8ee2-4e1c-ae6a-8a90698b8bb7
Status: Started
Snapshot Count: 0
Number of Bricks: 15 x (2 + 1) = 45
Transport-type: tcp
Bricks:
Brick1: gluster4:/data/glusterfs/the-volume/brick-2
Brick2: gluster2:/data/glusterfs/the-volume/brick-2/b
Brick3: gluster1:/data/glusterfs/the-volume/brick-2
Brick4: gluster4:/data/glusterfs/the-volume/brick-3
Brick5: gluster2:/data/glusterfs/the-volume/brick-3/b
Brick6: gluster1:/data/glusterfs/the-volume/brick-3
Brick7: gluster4:/data/glusterfs/the-volume/brick-4
Brick8: gluster2:/data/glusterfs/the-volume/brick-4/b
Brick9: gluster1:/data/glusterfs/the-volume/brick-4
Brick10: gluster4:/data/glusterfs/the-volume/brick-5
Brick11: gluster2:/data/glusterfs/the-volume/brick-5/b
Brick12: gluster1:/data/glusterfs/the-volume/brick-5
Brick13: gluster4:/data/glusterfs/the-volume/brick-1
Brick14: gluster2:/data/glusterfs/the-volume/brick-1/b
Brick15: gluster1:/data/glusterfs/the-volume/brick-1
Brick16: gluster4:/data/glusterfs/the-volume/brick-6
Brick17: gluster2:/data/glusterfs/the-volume/brick-6/b
Brick18: gluster1:/data/glusterfs/the-volume/brick-6
Brick19: gluster4:/data/glusterfs/the-volume/brick-7
Brick20: gluster2:/data/glusterfs/the-volume/brick-7/b
Brick21: gluster1:/data/glusterfs/the-volume/brick-7
Brick22: gluster4:/data/glusterfs/the-volume/brick-8
Brick23: gluster2:/data/glusterfs/the-volume/brick-8/b
Brick24: gluster1:/data/glusterfs/the-volume/brick-8
Brick25: gluster4:/data/glusterfs/the-volume/brick-9
Brick26: gluster2:/data/glusterfs/the-volume/brick-9/b
Brick27: gluster1:/data/glusterfs/the-volume/brick-9
Brick28: gluster1:/data/glusterfs/the-volume/brick-10/brick
Brick29: gluster2:/data/glusterfs/the-volume/brick-10/brick
Brick30: gluster4:/data/glusterfs/the-volume/brick-10/brick
Brick31: gluster1:/data/glusterfs/the-volume/brick-11/brick
Brick32: gluster2:/data/glusterfs/the-volume/brick-11/brick
Brick33: gluster4:/data/glusterfs/the-volume/brick-11/brick
Brick34: gluster1:/data/glusterfs/the-volume/brick-12/brick
Brick35: gluster2:/data/glusterfs/the-volume/brick-12/brick
Brick36: gluster4:/data/glusterfs/the-volume/brick-12/brick
Brick37: gluster1:/data/glusterfs/the-volume/brick-13/brick
Brick38: gluster2:/data/glusterfs/the-volume/brick-13/brick
Brick39: gluster4:/data/glusterfs/the-volume/brick-13/brick
Brick40: gluster1:/data/glusterfs/the-volume/brick-14/brick
Brick41: gluster2:/data/glusterfs/the-volume/brick-14/brick
Brick42: gluster4:/data/glusterfs/the-volume/brick-14/brick
Brick43: gluster1:/data/glusterfs/the-volume/brick-15/brick
Brick44: gluster2:/data/glusterfs/the-volume/brick-15/brick
Brick45: gluster4:/data/glusterfs/the-volume/brick-15/brick
Options Reconfigured:
storage.owner-uid: 3333
storage.owner-gid: 3333
cluster.rebal-throttle: lazy
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
performance.client-io-threads: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 200000
performance.readdir-ahead: on
performance.parallel-readdir: on

- The output of the gluster volume status command:

Status of volume: the-volume
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gluster4:/data/glusterfs/the-volu
me/brick-2                                  59323     0          Y       2079 
Brick gluster2:/data/glusterfs/the-volu
me/brick-2/b                                49945     0          Y       2072 
Brick gluster1:/data/glusterfs/the-volu
me/brick-2                                  55244     0          Y       775693
Brick gluster4:/data/glusterfs/the-volu
me/brick-3                                  52160     0          Y       2083 
Brick gluster2:/data/glusterfs/the-volu
me/brick-3/b                                59480     0          Y       2079 
Brick gluster1:/data/glusterfs/the-volu
me/brick-3                                  57257     0          Y       775881
Brick gluster4:/data/glusterfs/the-volu
me/brick-4                                  59794     0          Y       2092 
Brick gluster2:/data/glusterfs/the-volu
me/brick-4/b                                49511     0          Y       2095 
Brick gluster1:/data/glusterfs/the-volu
me/brick-4                                  55866     0          Y       776128
Brick gluster4:/data/glusterfs/the-volu
me/brick-5                                  49806     0          Y       2104 
Brick gluster2:/data/glusterfs/the-volu
me/brick-5/b                                49886     0          Y       2101 
Brick gluster1:/data/glusterfs/the-volu
me/brick-5                                  60343     0          Y       776446
Brick gluster4:/data/glusterfs/the-volu
me/brick-1                                  49911     0          Y       2109 
Brick gluster2:/data/glusterfs/the-volu
me/brick-1/b                                58524     0          Y       2116 
Brick gluster1:/data/glusterfs/the-volu
me/brick-1                                  54262     0          Y       611244
Brick gluster4:/data/glusterfs/the-volu
me/brick-6                                  49605     0          Y       2121 
Brick gluster2:/data/glusterfs/the-volu
me/brick-6/b                                60200     0          Y       2122 
Brick gluster1:/data/glusterfs/the-volu
me/brick-6                                  53176     0          Y       811671
Brick gluster4:/data/glusterfs/the-volu
me/brick-7                                  58323     0          Y       2150 
Brick gluster2:/data/glusterfs/the-volu
me/brick-7/b                                57072     0          Y       2142 
Brick gluster1:/data/glusterfs/the-volu
me/brick-7                                  56213     0          Y       811960
Brick gluster4:/data/glusterfs/the-volu
me/brick-8                                  57732     0          Y       2144 
Brick gluster2:/data/glusterfs/the-volu
me/brick-8/b                                53440     0          Y       2148 
Brick gluster1:/data/glusterfs/the-volu
me/brick-8                                  49418     0          Y       812373
Brick gluster4:/data/glusterfs/the-volu
me/brick-9                                  54655     0          Y       2146 
Brick gluster2:/data/glusterfs/the-volu
me/brick-9/b                                57784     0          Y       2160 
Brick gluster1:/data/glusterfs/the-volu
me/brick-9                                  49747     0          Y       812824
Brick gluster1:/data/glusterfs/the-volu
me/brick-10/brick                           57437     0          Y       1041056
Brick gluster2:/data/glusterfs/the-volu
me/brick-10/brick                           50079     0          Y       11329
Brick gluster4:/data/glusterfs/the-volu
me/brick-10/brick                           53452     0          Y       7141 
Brick gluster1:/data/glusterfs/the-volu
me/brick-11/brick                           52977     0          Y       1041505
Brick gluster2:/data/glusterfs/the-volu
me/brick-11/brick                           56015     0          Y       11759
Brick gluster4:/data/glusterfs/the-volu
me/brick-11/brick                           60303     0          Y       7212 
Brick gluster1:/data/glusterfs/the-volu
me/brick-12/brick                           50505     0          Y       1042267
Brick gluster2:/data/glusterfs/the-volu
me/brick-12/brick                           53435     0          Y       12372
Brick gluster4:/data/glusterfs/the-volu
me/brick-12/brick                           54059     0          Y       7292 
Brick gluster1:/data/glusterfs/the-volu
me/brick-13/brick                           60925     0          Y       1043081
Brick gluster2:/data/glusterfs/the-volu
me/brick-13/brick                           51823     0          Y       13058
Brick gluster4:/data/glusterfs/the-volu
me/brick-13/brick                           56723     0          Y       7445 
Brick gluster1:/data/glusterfs/the-volu
me/brick-14/brick                           57201     0          Y       1043815
Brick gluster2:/data/glusterfs/the-volu
me/brick-14/brick                           60582     0          Y       13811
Brick gluster4:/data/glusterfs/the-volu
me/brick-14/brick                           60141     0          Y       7662 
Brick gluster1:/data/glusterfs/the-volu
me/brick-15/brick                           59438     0          Y       1044952
Brick gluster2:/data/glusterfs/the-volu
me/brick-15/brick                           54259     0          Y       14678
Brick gluster4:/data/glusterfs/the-volu
me/brick-15/brick                           51585     0          Y       7886 
Self-heal Daemon on localhost               N/A       N/A        Y       812835
Self-heal Daemon on gluster4.     N/A       N/A        Y       2193 
Self-heal Daemon on gluster2.     N/A       N/A        Y       2190 

Task Status of Volume the-volume
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 4313b398-a5d7-46e4-883d-ea3147987e98
Status               : completed

- The output of the gluster volume heal command: Error : Request timed out

- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

There are a lot of these and they're pretty large. Would a sample be OK?

- Is there any crash ? Provide the backtrace and coredump I have a large strace - it's 216KB which seems like it shouldn't be provided in a bug report.

Additional info: Some names/paths have been changed to protect the innocent

- The operating system / glusterfs version: Both client and server are running: Ubuntu 22.04.1 LTS Gluster version 10.1

thalin commented 1 year ago

Hey folks, I was able to reproduce this behavior with a fresh set of Ubuntu 22.10 VMs and Gluster 3.10. I've made a reproduction case using Ansible and created a repo for it. I also included some of the stuff I had excluded for brevity in the original bug report (logs, strace, etc). You can find this repo here. Hopefully this helps to debug the issue!

thalin commented 1 year ago

I tested this on Gluster 11.0 (via an upgrade from 10.1 - I did not create a cluster from scratch using 11 and test that yet) from the Gluster team PPA and it doesn't seem to make a big difference, despite some Gluster release notes claiming improvements to maybe-related code. It's still as slow as ever, as far as I can tell.

thalin commented 1 year ago

Hello folks, I just wanted to try to follow up here. We're still having this problem. I'd be happy to provide any further information requested. Thanks!

Mimikoo commented 1 year ago

Confirming the problem. Just started to test different distributed storage. Listing takes a lot of time. Also gluster volume status`` echoError : Request timed out```. Thought glusterfs a promising one.

[2023-04-28 08:21:44.094407 +0000] I [MSGID: 106499] [glusterd-handler.c:4458:__glusterd_handle_status_volume] 0-management: Received status volume req for volume www1
[2023-04-28 08:22:02.704835 +0000] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0x9e952) [0x7f5b32fce952] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0xd48f2) [0x7f5b330048f2] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0xd4352) [0x7f5b33004352] ) 0-management: Lock for www1 held by 7885d813-0968-4e8d-8f1d-12a768288f62
[2023-04-28 08:22:02.704922 +0000] E [MSGID: 106118] [glusterd-syncop.c:1923:gd_sync_task_begin] 0-management: Unable to acquire lock for www1
[2023-04-28 08:22:18.694418 +0000] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0x9e952) [0x7f5b32fce952] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0xd48f2) [0x7f5b330048f2] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0xd4352) [0x7f5b33004352] ) 0-management: Lock for www1 held by 7885d813-0968-4e8d-8f1d-12a768288f62
[2023-04-28 08:22:58.732966 +0000] W [glusterd-locks.c:577:glusterd_mgmt_v3_lock] (-->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0x9e952) [0x7f5b32fce952] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0xd48f2) [0x7f5b330048f2] -->/usr/lib/x86_64-linux-gnu/glusterfs/9.2/xlator/mgmt/glusterd.so(+0xd4352) [0x7f5b33004352] ) 0-management: Lock for www1 held by 7885d813-0968-4e8d-8f1d-12a768288f62
The message "E [MSGID: 106118] [glusterd-syncop.c:1923:gd_sync_task_begin] 0-management: Unable to acquire lock for www1" repeated 2 times between [2023-04-28 08:22:02.704922 +0000] and [2023-04-28 08:22:58.733063 +0000]
mohit84 commented 1 year ago

performance.client-io-threads: on

Can you please share the below data?

1) Enable profile gluster v profile start 2) Clean the profile gluster v profile info clear 3) Run the test case 4) Run the profile again gluster v profile info

Please share the current gluster release on which you are executing a test case.

Mimikoo commented 1 year ago

My problem was resolved. It was because of MTU 9000 set on interfaces on servers, while interconnecting switches does not had jumbo frame activated.

anon314159 commented 10 months ago

I can confirm this problem has not been resolved. strace -Tc ls -hal reveals an excess number of getdents among other directory and stat related syscalls on any fuse mounts associated with GlusterFS

For additional context, I also have MooseFS running in parallel with Gluster on the same exact hardware configuration and the slow directory issues do not exist. Both file systems use FUSE, so most likely not a FUSE issue but something internal to GlusterFS. The problem isn't a small file problem but a many file issue.

Sadly, my servers reside on an air gapped environment and I cannot provide logs, configuration data or console output. However, this issue continues to exist in every version of GlusterFS (7.x - 11.x) regardless of the volume type or configuration options set. Based on the longevity of this issue, I find it highly unlikely it will ever get resolved before Glusters eol.

mohit84 commented 10 months ago

Can you please share the below data?

  1. Enable profile gluster v profile start
  2. Clean the profile gluster v profile info clear
  3. Run the test case
  4. Run the profile again gluster v profile info

Please share the current gluster release on which you are executing a test case.

Can you please share the below data?

Enable profile gluster v profile start Clean the profile gluster v profile info clear Run the test case Run the profile again gluster v profile info and share gluster v info output also Please share the current gluster release on which you are executing a test case.

anon314159 commented 10 months ago

Can you please share the below data?

  1. Enable profile gluster v profile start
  2. Clean the profile gluster v profile info clear
  3. Run the test case
  4. Run the profile again gluster v profile info

Please share the current gluster release on which you are executing a test case.

Can you please share the below data?

Enable profile gluster v profile start Clean the profile gluster v profile info clear Run the test case Run the profile again gluster v profile info and share gluster v info output also Please share the current gluster release on which you are executing a test case.

Sure thing I will start profiling one of our test volumes today. Unfortunately, I work in an air-gapped environment so providing any dumps is going to be tricky. I will attempt to transpose the data listed in your template, so please bear with me.

My environment is running RHEL 8.9 and GlusterFS 10.3 with a combination of dispersed and distributed-dispersed volumes in a 8 x (2 + 1) configuration.

I think this issue tracks back to the limited buffer size associated with readdir/readdirp syscalls creating a bottleneck. Actual volume performance is relatively good based on the hardware, but FUSE seems to only pull stat/metadata in groups of 1K files at a time.

mohit84 commented 10 months ago

Can you please share the below data?

  1. Enable profile gluster v profile start
  2. Clean the profile gluster v profile info clear
  3. Run the test case
  4. Run the profile again gluster v profile info

Please share the current gluster release on which you are executing a test case.

Can you please share the below data? Enable profile gluster v profile start Clean the profile gluster v profile info clear Run the test case Run the profile again gluster v profile info and share gluster v info output also Please share the current gluster release on which you are executing a test case.

Sure thing I will start profile one of test volumes today. Unfortunately, I work in an air-gapped environment so providing any dumps is going to be tricky. I will attempt to transpose the data listed in your template, so please bear with me.

My environment is running RHEL 8.9 and GlusterFS 10.3 with a combination of dispersed and distributed-dispersed volumes in a 8 x (2 + 1) configuration.

I think this issue tracks back to the limited buffer size associated with readdir/readdirp syscalls creating a bottleneck. Actual volume performance is relatively good based on the hardware, but FUSE seems to only pull stat/metadata in groups of 1K files at a time.

After enabling readdir-ahead there should not be an issue with buffe size for readdir(p) systecall because readdir-ahead increase the buffer size to 128k and saved the dentries in their cache.

mohit84 commented 10 months ago

Just FYI The readdir-ahead works only for readdirp not for readdir.

anon314159 commented 10 months ago

Can you please share the below data?

  1. Enable profile gluster v profile start
  2. Clean the profile gluster v profile info clear
  3. Run the test case
  4. Run the profile again gluster v profile info

Please share the current gluster release on which you are executing a test case.

Can you please share the below data? Enable profile gluster v profile start Clean the profile gluster v profile info clear Run the test case Run the profile again gluster v profile info and share gluster v info output also Please share the current gluster release on which you are executing a test case.

Sure thing I will start profile one of test volumes today. Unfortunately, I work in an air-gapped environment so providing any dumps is going to be tricky. I will attempt to transpose the data listed in your template, so please bear with me. My environment is running RHEL 8.9 and GlusterFS 10.3 with a combination of dispersed and distributed-dispersed volumes in a 8 x (2 + 1) configuration. I think this issue tracks back to the limited buffer size associated with readdir/readdirp syscalls creating a bottleneck. Actual volume performance is relatively good based on the hardware, but FUSE seems to only pull stat/metadata in groups of 1K files at a time.

After enabling readdir-ahead there should not be an issue with buffe size for readdir(p) systecall because readdir-ahead increase the buffer size to 128k and saved the dentries in their cache.

Gotcha, thanks for the quick reply. I will verify read-ahead is enabled and working correctly.

mohit84 commented 10 months ago
  • [ ]

Can you please share the below data?

  1. Enable profile gluster v profile start
  2. Clean the profile gluster v profile info clear
  3. Run the test case
  4. Run the profile again gluster v profile info

Please share the current gluster release on which you are executing a test case.

Can you please share the below data? Enable profile gluster v profile start Clean the profile gluster v profile info clear Run the test case Run the profile again gluster v profile info and share gluster v info output also Please share the current gluster release on which you are executing a test case.

Sure thing I will start profile one of test volumes today. Unfortunately, I work in an air-gapped environment so providing any dumps is going to be tricky. I will attempt to transpose the data listed in your template, so please bear with me. My environment is running RHEL 8.9 and GlusterFS 10.3 with a combination of dispersed and distributed-dispersed volumes in a 8 x (2 + 1) configuration. I think this issue tracks back to the limited buffer size associated with readdir/readdirp syscalls creating a bottleneck. Actual volume performance is relatively good based on the hardware, but FUSE seems to only pull stat/metadata in groups of 1K files at a time.

After enabling readdir-ahead there should not be an issue with buffe size for readdir(p) systecall because readdir-ahead increase the buffer size to 128k and saved the dentries in their cache.

Gotcha, thanks for the quick reply. I will verify read-ahead is enabled and working correctly.

It's readdir-ahead not read-ahead, just FYI.