gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.73k stars 1.08k forks source link

Gluster 10.3 directory listing slow #4195

Open shexuel opened 1 year ago

shexuel commented 1 year ago

Description of problem: Gluster volume is slow on the directory listing, error messages during slow listing:

[2023-07-05 11:15:09.570688 +0000] I [fuse-bridge.c:4992:notify_kernel_loop] 0-glusterfs-fuse: len: 71, rv: -1, errno: 20 [2023-07-05 11:15:09.570725 +0000] I [fuse-bridge.c:4992:notify_kernel_loop] 0-glusterfs-fuse: len: 71, rv: -1, errno: 20 [2023-07-05 11:15:09.570732 +0000] I [fuse-bridge.c:4992:notify_kernel_loop] 0-glusterfs-fuse: len: 71, rv: -1, errno: 20 [2023-07-05 11:15:09.570741 +0000] I [fuse-bridge.c:4992:notify_kernel_loop] 0-glusterfs-fuse: len: 71, rv: -1, errno: 20 [2023-07-05 11:15:09.570748 +0000] I [fuse-bridge.c:4992:notify_kernel_loop] 0-glusterfs-fuse: len: 71, rv: -1, errno: 20 [2023-07-05 11:15:09.570772 +0000] I [fuse-bridge.c:4992:notify_kernel_loop] 0-glusterfs-fuse: len: 71, rv: -1, errno: 20 The message "I [MSGID: 108031] [afr-common.c:3203:afr_local_discovery_cbk] 42-archive-replicate-0: selecting local read_child archive-client-0" repeated 148 times between [2023-07-05 11:13:19.417348 +0000] and [2023-07-05 11:15:18.030077 +0000] [2023-07-05 11:15:19.044807 +0000] I [MSGID: 108031] [afr-common.c:3203:afr_local_discovery_cbk] 42-archive-replicate-0: selecting local read_child archive-client-0 [2023-07-05 11:15:20.462206 +0000] W [fuse-bridge.c:310:check_and_dump_fuse_W] 0-glusterfs-fuse: writing to fuse device yielded ENOENT 256 times [2023-07-05 11:15:42.781829 +0000] W [fuse-bridge.c:310:check_and_dump_fuse_W] 0-glusterfs-fuse: writing to fuse device yielded ENOENT 256 times [2023-07-05 11:15:50.475283 +0000] I [fuse-bridge.c:4992:notify_kernel_loop] 0-glusterfs-fuse: len: 71, rv: -1, errno: 20 [2023-07-05 11:15:50.475313 +0000] I [fuse-bridge.c:4992:notify_kernel_loop] 0-glusterfs-fuse: len: 71, rv: -1, errno: 20 [2023-07-05 11:15:50.475320 +0000] I [fuse-bridge.c:4992:notify_kernel_loop] 0-glusterfs-fuse: len: 71, rv: -1, errno: 20 [2023-07-05 11:15:50.475325 +0000] I [fuse-bridge.c:4992:notify_kernel_loop] 0-glusterfs-fuse: len: 71, rv: -1, errno: 20 [2023-07-05 11:15:50.629862 +0000] I [fuse-bridge.c:4992:notify_kernel_loop] 0-glusterfs-fuse: len: 71, rv: -1, errno: 20 [2023-07-05 11:15:58.359047 +0000] I [fuse-bridge.c:4992:notify_kernel_loop] 0-glusterfs-fuse: len: 71, rv: -1, errno: 20 [2023-07-05 11:15:58.359099 +0000] I [fuse-bridge.c:4992:notify_kernel_loop] 0-glusterfs-fuse: len: 71, rv: -1, errno: 20 [2023-07-05 11:16:38.511419 +0000] W [fuse-bridge.c:310:check_and_dump_fuse_W] 0-glusterfs-fuse: writing to fuse device yielded ENOENT 256 times

The exact command to reproduce the issue: ls -la /archive

The full output of the command that failed:

**Expected results:** **Mandatory info:** **- The output of the `gluster volume info` command**: Volume Name: archive Type: Replicate Volume ID: 787730f4-3d05-4e37-95be-925cc3b266e6 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 7 = 7 Transport-type: tcp Bricks: Brick1: 1c:/mnt/archive Brick2: 2c:/mnt/archive Brick3: 3c:/mnt/archive Brick4: 4c:/mnt/archive Brick5: 5c:/mnt/archive Brick6: 6c:/mnt/archive Brick7: 7c:/mnt/archive Options Reconfigured: performance.nl-cache-timeout: 600 performance.nl-cache: on performance.rda-cache-limit: 128MB performance.md-cache-timeout: 600 performance.cache-invalidation: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on performance.force-readdirp: true dht.force-readdirp: on network.inode-lru-limit: 200000 performance.open-behind: off cluster.shd-max-threads: 4 performance.readdir-ahead: on cluster.entry-self-heal: on cluster.data-self-heal: on cluster.metadata-self-heal: on performance.flush-behind: off cluster.readdir-optimize: on performance.io-thread-count: 64 client.event-threads: 8 server.event-threads: 8 cluster.self-heal-window-size: 1024 cluster.self-heal-readdir-size: 16384 cluster.background-self-heal-count: 128 performance.io-cache: on performance.quick-read: on performance.stat-prefetch: off performance.read-ahead: on performance.parallel-readdir: on performance.strict-o-direct: off performance.write-behind: off cluster.granular-entry-heal: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: on **- The output of the `gluster volume status` command**: Status of volume: archive Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 1c:/mnt/archive 59908 0 Y 4493 Brick 2c:/mnt/archive 59683 0 Y 2990 Brick 3c:/mnt/archive 58617 0 Y 3000 Brick 4c:/mnt/archive 56390 0 Y 2966 Brick 5c:/mnt/archive 51213 0 Y 2939 Brick 6c:/mnt/archive 52787 0 Y 3015 Brick 7c:/mnt/archive 52074 0 Y 2938 Self-heal Daemon on localhost N/A N/A Y 4834 Self-heal Daemon on 4c N/A N/A Y 3026 Self-heal Daemon on 2c N/A N/A Y 3057 Self-heal Daemon on 3c N/A N/A Y 3071 Self-heal Daemon on 6c N/A N/A Y 3083 Self-heal Daemon on 5c N/A N/A Y 3005 Self-heal Daemon on 7c N/A N/A Y 3002 Task Status of Volume archive ------------------------------------------------------------------------------ There are no active volume tasks **- The output of the `gluster volume heal` command**: Brick 1c:/mnt/archive Status: Connected Number of entries: 0 Brick 2c:/mnt/archive Status: Connected Number of entries: 0 Brick 3c:/mnt/archive Status: Connected Number of entries: 0 Brick 4c:/mnt/archive Status: Connected Number of entries: 0 Brick 5c:/mnt/archive Status: Connected Number of entries: 0 Brick 6c:/mnt/archive Status: Connected Number of entries: 0 Brick 7c:/mnt/archive Status: Connected Number of entries: 0 **- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/ **- Is there any crash ? Provide the backtrace and coredump **Additional info:**

- The operating system / glusterfs version: Centos Stream 8, Gluster 10.3 Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

shexuel commented 1 year ago

if you do: gluster volume set archive performance.client-io-threads off Everything is working after that normally. This command or any other just resets the volume for some time, and again the issue is present after a day.

shexuel commented 1 year ago

Updated to version 10.4, everything is the same. I noticed when a large number of files are read simultaneously bricks break from the volume and doing reconnect.

I have done a test with with these setting and everything is same: config.client-threads 48 config.brick-threads 48 transport.listen-backlog 4096 server.outstanding-rpc-limit 512 network.inode-lru-limit 30000 client.event-threads 32 server.event-threads 32 performance.io-thread-count 64

shexuel commented 1 year ago

I created a monitoring bash script that checks the directory listing if slow and above one second, then reset it with just gluster volume set archive parallel-readdir on sleep 2 gluster volume set archive parallel-readdir off

Everything is working normally, but not without these commands. The problem is not solved, just sanitized.

zyfinity01 commented 1 year ago

Did you manage to resolve this? I am having the same issue.

shexuel commented 1 year ago

as I said just do,

gluster volume set archive parallel-readdir on sleep 2 gluster volume set archive parallel-readdir off

it will work

anon314159 commented 10 months ago

This issue has not been resolved as I can reproduce it in nearly every version of GlusterFS going as far back as 7.x regardless of volume configurations, or tunning options. Even throwing high-end hardware at the problem does not resolve it (i.e. I built an all nvme flash cluster with a high single threaded IPC processor and listing directories with lots of files is still incredibly slow). Additionally, I have tried several other FUSE based distributed file systems such as MooseFS, BeeGFS, and SeaweedFS running on the same hardware without any major issues that persist with GlusterFS. I guess by the time this issue gets resolved Red Hat will have eol'ed it. This file system is a terrible choice if you require any form of end user interaction (FUSE, VFS, or re-exports as SMB/NFS.) with it but is more than capable of dealing with backend type of workloads where user interaction is far less of a concern. Lately, I have been migrating all of my customers away from it do to this persistent issue. The source of the problem isn't necessarily a GlusterFS issue but its implementation of the readdir/readdirp (getdents/getdents64) system calls. This have been historically problematic due to the implementation using a very small buffer size (1024 entries) in order to conserve memory, which does not scale well with directories containing lots of files spanning across multiple systems (trashes performance and adds insane amount of latency). The other file systems do not experience this issue because they use dedicated metadata storage devices instead of the p2p distributed nature of GlusterFS. This has its own host of problems such as SPOF and performance bottlenecks.

https://linux.die.net/man/2/getdents64

shexuel commented 10 months ago

Thanks, this is terrible. What FS did you use at the end? I have tens or hundreds of millions of files.

anon314159 commented 10 months ago

Thanks, this is terrible. What FS did you use at the end? I have tens or hundreds of millions of files.

Yes, but determining what file system will work for you should be requirements based. However, I have used several OSS and Proprietary ones and determined both Ceph (Open Source) and Dell's OneFS (proprietary and expensive as hell though) offer the best performance, features, protocol support, and security for the enterprise. Ceph requires a considerably higher learning curve to setup and configure due to understanding and taking the dedicated roles required into consideration during the initial setup. If you have a few dollars to spend I believe MooseFS Pro is hands down just as easy to setup as GlusterFS and has a much better management interface called CGI Server. My reasoning for going with pro is if you need to use erasure codes, but if a distributed, replicated, or distributed- replicated volume type is what you currently use, then the non pro version will work perfectly fine using storage goals and classes. One last thing, the file systems I list above do not experience the same directory listing issues or better yet the really obnoxious volume performance degradation when adding files/folders. Yes, we discovered that volume performance straight tanks after adding new files or folders over a given period of time regardless of the lru or other volume settings.

zyfinity01 commented 10 months ago

Thanks, this is terrible. What FS did you use at the end? I have tens or hundreds of millions of files.

Yes, but determining what file system will work for you should be requirements based. However, I have used several OSS and Proprietary ones and determined both Ceph (Open Source) and Dell's OneFS (proprietary and expensive as hell though) offer the best performance, features, protocol support, and security for the enterprise. Ceph requires a considerably higher learning curve to setup and configure due to understanding and taking the dedicated roles required into consideration during the initial setup. If you have a few dollars to spend I believe MooseFS Pro is hands down just as easy to setup as GlusterFS and has a much better management interface called CGI Server. My reasoning for going with pro is if you need to use erasure codes, but if a distributed, replicated, or distributed- replicated volume type is what you currently use, then the non pro version will work perfectly fine using storage goals and classes. One last thing, the file systems I list above do not experience the same directory listing issues or better yet the really obnoxious volume performance degradation when adding files/folders. Yes, we discovered that volume performance straight tanks after adding new files or folders over a given period of time regardless of the lru or other volume settings.

Would MooseFS free be able to be setup alongside gluster? As with my gluster setup I’m currently just using distributed ontop of ZFS. So would I be able to setup moosefs on the same brick and test them side by side (well atleast test moosefs for reads) as to not mess with glusterfs write stuff

anon314159 commented 10 months ago

Thanks, this is terrible. What FS did you use at the end? I have tens or hundreds of millions of files.

Yes, but determining what file system will work for you should be requirements based. However, I have used several OSS and Proprietary ones and determined both Ceph (Open Source) and Dell's OneFS (proprietary and expensive as hell though) offer the best performance, features, protocol support, and security for the enterprise. Ceph requires a considerably higher learning curve to setup and configure due to understanding and taking the dedicated roles required into consideration during the initial setup. If you have a few dollars to spend I believe MooseFS Pro is hands down just as easy to setup as GlusterFS and has a much better management interface called CGI Server. My reasoning for going with pro is if you need to use erasure codes, but if a distributed, replicated, or distributed- replicated volume type is what you currently use, then the non pro version will work perfectly fine using storage goals and classes. One last thing, the file systems I list above do not experience the same directory listing issues or better yet the really obnoxious volume performance degradation when adding files/folders. Yes, we discovered that volume performance straight tanks after adding new files or folders over a given period of time regardless of the lru or other volume settings.

Would MooseFS free be able to be setup alongside gluster? As with my gluster setup I’m currently just using distributed ontop of ZFS. So would I be able to setup moosefs on the same brick and test them side by side (well atleast test moosefs for reads) as to not mess with glusterfs write stuff

Yes, you can run MooseFS in parallel with gluster it's just a matter of creating another directory that's external to your GlusterFS bricks on the same storage device.

anon314159 commented 6 months ago

Reference this issue as the possible source of slow directory operations:

https://github.com/gluster/glusterfs/issues/4335

jkroonza commented 1 month ago

Reference this issue as the possible source of slow directory operations:

4335

You can confirm this, along with a kernel fuse implementation issue, which I've got a private patch for which also needs a v3. Just don't have the correct link at hand, but basically the fuse read buffer for readdir needs to be made much, much larger, as in orders of magnitude larger (256KB vs 4KB).