gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.72k stars 1.08k forks source link

dentry_hashsize #4335

Open jkroonza opened 6 months ago

jkroonza commented 6 months ago

Description of problem:

Slow performance ... as usual :).

The exact command to reproduce the issue:

None.

Mandatory info: - The output of the gluster volume info command:

uriel [15:02:05] ~ # gluster volume info

Volume Name: mail
Type: Replicate
Volume ID: 2938a063-f53d-4a1c-a84f-c1406bdc260d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: xxx:/mnt/gluster/mail
Brick2: yyy:/mnt/gluster/mail
Options Reconfigured:
performance.quick-read: off
performance.client-io-threads: off
transport.address-family: inet
storage.fips-mode-rchecksum: on
cluster.granular-entry-heal: enable
cluster.readdir-optimize: on
performance.least-prio-threads: 8
performance.low-prio-threads: 8
performance.normal-prio-threads: 16
performance.high-prio-threads: 32
performance.io-thread-count: 64
cluster.data-self-heal-algorithm: full
server.event-threads: 4
config.brick-threads: 64
config.global-threading: on
performance.iot-pass-through: true
performance.open-behind: off
cluster.locking-scheme: granular
performance.open-behind-pass-through: true
config.client-threads: 2
client.event-threads: 4
performance.write-behind-pass-through: true
performance.write-behind: off
network.inode-lru-limit: 1048576
performance.parallel-readdir: off
performance.readdir-ahead: off

**- Is there any crash ? Provide the backtrace and coredump

No crash.

Additional info:

I'd like to discuss options to increase:

network.inode-lru-limit: 1048576

Our fuse mounts are mounted with --lru-limit=2097152 --inode-table-size=2097152 (state dump shows just over 1m entries in use) and we reckon bumping same for bricks should have some effect. As I understand the network.inode-lru-limit sets same/similar for bricks, whilst that's running on NL-SAS drives behind a battery-backed RAID controller and has 1TB NVMe write-through LVM dm-cache iostat is starting to show some pressure on the NVMe units, so we would like to sacrifice some more RAM in order to improve the overall performance.

There is also the dentry_hashsize which I believe may make a difference, in particular because the single largest bottleneck is readdir() (even though we've applied kernel patches to not read 4KB at a time but rather try to read 256KB at a time - in line with glibc getdents() calls) we suspect the performance here can still be improved if glusterfs can more frequently "pick up where it left off".

Unfortunately I don't understand all the impacts, previous issue reports had some patches that caused other problems, so we're looking for simpler lower impact changes such as increasing the amount of memory used rather.

From gluster state dump for one of the fuse mounts:

[xlator.mount.fuse.itable]
xlator.mount.fuse.itable.dentry_hashsize=14057
xlator.mount.fuse.itable.inode_hashsize=2097152
xlator.mount.fuse.itable.name=meta-autoload/inode
xlator.mount.fuse.itable.lru_limit=2097152
xlator.mount.fuse.itable.active_size=3811
xlator.mount.fuse.itable.lru_size=1025826
xlator.mount.fuse.itable.purge_size=0
xlator.mount.fuse.itable.invalidate_size=0

So we're hoping that we can bump that 14057 to something a little larger (say 100003). Whilst we can certainly just apply this:

--- a/libglusterfs/src/inode.c
+++ b/libglusterfs/src/inode.c
@@ -1699,7 +1699,7 @@ inode_table_with_invalidator(uint32_t lru_limit, xlator_t *xl,

     if (dentry_hashsize == 0) {
         /* Prime number for uniform distribution */
-        new->dentry_hashsize = 14057;
+        new->dentry_hashsize = 100003;
     } else {
         new->dentry_hashsize = dentry_hashsize;
     }

We are not certain we understand the impact or what would all be affected, as such we don't want to blindly jump in and make random changes without understanding the potential impact.

Kind regards, Jaco

jkroonza commented 6 months ago

Just to note: The only inode_table with a dentry_hashsize that is not the default is the SHD that uses 131. This may in part explain the generally poor performance of SHD when things go badly wrong here, although, it's been drastically better in 11.X.

mykaul commented 6 months ago

which version are you using?

Do you really feel the bottleneck in posix_readdirp_fill() is the inode_grep() and not the posix_pstat() (which is going all the way to disk to fetch metadata, multiple times) ?

jkroonza commented 6 months ago

Hi @mykaul

Currently on version 11.1. Kernel is 6.4.6 with https://lore.kernel.org/lkml/20230727081237.18217-1-jaco@uls.co.za/ (there is a follow up discussion and I never got around to v3 ..., v2 works but apparently is definitely wrong and should crash the kernel, but it doesn't, nor does it seem to leak memory, I think I'm just over-allocating a lot of memory pointlessly here).

I don't understand things well enough to give you an out of hand answer, what I can state (without having to think about this) is the getdents() is not performing well, where the bottleneck is inside of glusterfs I'm guessing at:

On a folder with ~10k files in it (unfortunately not quite unusual, was 70k and we've got folders with >100k):

# strace -T find .
...
getdents64(4, 0x7efd25f42040 /* 486 entries */, 131072) = 36592 <0.819250>
brk(0x56457f7d4000)                     = 0x56457f7d4000 <0.000027>
getdents64(4, 0x7efd25f42040 /* 485 entries */, 131072) = 36864 <0.894693>
brk(0x56457f7f5000)                     = 0x56457f7f5000 <0.000019>
getdents64(4, 0x7efd25f42040 /* 485 entries */, 131072) = 36824 <0.869823>
brk(0x56457f816000)                     = 0x56457f816000 <0.000019>
getdents64(4, 0x7efd25f42040 /* 485 entries */, 131072) = 36888 <1.130885>
brk(0x56457f837000)                     = 0x56457f837000 <0.000023>
getdents64(4, 0x7efd25f42040 /* 485 entries */, 131072) = 36744 <1.144131>
brk(0x56457f858000)                     = 0x56457f858000 <0.000026>

So yea, there you go, at ~500 entries/second that takes ~20 seconds just to enumerate the file list. When you're looking at larger folders that gets much, much slower very quickly. On the above folder where previously there was around 70k files we've waited several minutes for the above to complete. The further into the sequence you get, the slower the calls become (it's as if glusterfs restarts the readdir on the underlying file system almost every time.

For maildir this is probably the single most important sequence of system calls. Just to give an idea (df -hi for one of the bricks):

Filesystem              Inodes IUsed IFree IUse% Mounted on
/dev/mapper/lvm-gluster   448M   14M  435M    4% /mnt/gluster

And:

/mnt/gluster/mail # find . -type d | wc
 231662  396727 7857315

So that's average 62.90 dentries per folder (assuming my calculations are correct), however, a great many number of folders, at least two thirds, generally have nothing in them, and many clients use POP3 so those folders remain small too. As stated, the larger folders can trivially have upwards of 100k entries, I'll just need to make some effort to find a few of these again, but I've got a fair number with 40k+ entries by very quick search on some users known to have largish mailboxes.

Before having patched the kernel this was sometimes 3-4 seconds just to get perhaps 18 entries per getdents() call.

How can I confirm for you if the bottleneck is posix_readdirp_fill(), inode_grep() or posix_pstat()?

mykaul commented 6 months ago

How can I confirm for you if the bottleneck is posix_readdirp_fill(), inode_grep() or posix_pstat()?

I used to look at 'perf' results. But also counting time (with debug code added) around them could be useful.

I think there's also a way to dump all entries and then see how many we have on the same hash, which will help to determine if we should make it larger indeed. Other than some memory potentially wasted, I don't see a harm in making it larger.

mohit84 commented 6 months ago

Hi @mykaul

Currently on version 11.1. Kernel is 6.4.6 with https://lore.kernel.org/lkml/20230727081237.18217-1-jaco@uls.co.za/ (there is a follow up discussion and I never got around to v3 ..., v2 works but apparently is definitely wrong and should crash the kernel, but it doesn't, nor does it seem to leak memory, I think I'm just over-allocating a lot of memory pointlessly here).

I don't understand things well enough to give you an out of hand answer, what I can state (without having to think about this) is the getdents() is not performing well, where the bottleneck is inside of glusterfs I'm guessing at:

On a folder with ~10k files in it (unfortunately not quite unusual, was 70k and we've got folders with >100k):

# strace -T find .
...
getdents64(4, 0x7efd25f42040 /* 486 entries */, 131072) = 36592 <0.819250>
brk(0x56457f7d4000)                     = 0x56457f7d4000 <0.000027>
getdents64(4, 0x7efd25f42040 /* 485 entries */, 131072) = 36864 <0.894693>
brk(0x56457f7f5000)                     = 0x56457f7f5000 <0.000019>
getdents64(4, 0x7efd25f42040 /* 485 entries */, 131072) = 36824 <0.869823>
brk(0x56457f816000)                     = 0x56457f816000 <0.000019>
getdents64(4, 0x7efd25f42040 /* 485 entries */, 131072) = 36888 <1.130885>
brk(0x56457f837000)                     = 0x56457f837000 <0.000023>
getdents64(4, 0x7efd25f42040 /* 485 entries */, 131072) = 36744 <1.144131>
brk(0x56457f858000)                     = 0x56457f858000 <0.000026>

So yea, there you go, at ~500 entries/second that takes ~20 seconds just to enumerate the file list. When you're looking at larger folders that gets much, much slower very quickly. On the above folder where previously there was around 70k files we've waited several minutes for the above to complete. The further into the sequence you get, the slower the calls become (it's as if glusterfs restarts the readdir on the underlying file system almost every time.

For maildir this is probably the single most important sequence of system calls. Just to give an idea (df -hi for one of the bricks):

Filesystem              Inodes IUsed IFree IUse% Mounted on
/dev/mapper/lvm-gluster   448M   14M  435M    4% /mnt/gluster

And:

/mnt/gluster/mail # find . -type d | wc
 231662  396727 7857315

So that's average 62.90 dentries per folder (assuming my calculations are correct), however, a great many number of folders, at least two thirds, generally have nothing in them, and many clients use POP3 so those folders remain small too. As stated, the larger folders can trivially have upwards of 100k entries, I'll just need to make some effort to find a few of these again, but I've got a fair number with 40k+ entries by very quick search on some users known to have largish mailboxes.

Before having patched the kernel this was sometimes 3-4 seconds just to get perhaps 18 entries per getdents() call.

How can I confirm for you if the bottleneck is posix_readdirp_fill(), inode_grep() or posix_pstat()?

I don;t think after change the dentry_hashsize you will get an improvement for posix_pstat function but you can get improvement in inode_grep function. The dentry_hashsize will not impact system call performance(like getdents/stat) so you would not get any improvement for the function posix_pstat because it will improve inode_grep so you can get improvement in readdirp latency. You can generate statedump and compare the readdirp latency for posix xlator after change the dentry_hashsize. In case if latency is improved for readdirp at posix xlator then you can assume dentry_hashsize is improving.

mykaul commented 6 months ago

Is readdir-optimize option enabled, btw? should be valuable especially if you have many empty directories.

mohit84 commented 6 months ago

readdir-optimize is useful only for dht sub volumes here volume type is replicate so it would not be useful.

jkroonza commented 6 months ago

It's enabled regardless.

uriel [08:16:03] ~ # gluster volume get mail cluster.readdir-optimize
Option                                   Value                                  
------                                   -----                                  
cluster.readdir-optimize                 on                                     
uriel [08:16:12] ~ # 

Yea, with the way that maildir works (create file in tmp/ then rename() into appropriate location) any kind of distribute results in absolute chaos over time. We find that a simple replicate remains best. On less IO intensive workloads (voice recordings) where we typically have three or more nodes a setup of 3 x (2 + 1) works very well, even though we tend to link() and unlink() quite a bit anyway, however, we've done a LOT of work to keep files/folder counts down to <1000/folder as far as possible and I believe 95%+ of cases is below 100/folder.. For the maildir workload though we find that when files/folder goes up, the time to getdents() goes up equally badly, and given that at any point in time we've got upwards of 5000 connected clients, with definite hotspots during the day where sequences of readdir() (getdents in kernel speak) happens ...

We're swapping courier-imap for dovecot in a couple of days which should alleviate the pressure on readdir() + open/read/close cycles quite significantly, but the core readdir problem when clients select folders (basically checks new/ folder for new files, generate UID values for these files, rename them into cur/, then readdir() loop over cur/ checking that all files have UID values, and generate for missing, create a new UID list file in tmp/ then rename into .) So that readdir loop over cur/ if that takes 10 minutes creates a really bad experience.

So, I think immediate plan of action:

Patch two of the four fuse mounts to have larger dentry_hashsize (100003 seems like a sensible larger value). Grab a statedump before remount, after remount let it run for a few hours and then get another statedump so that things can be compared.

From yesterday's statedump I'm assuming the values we're looking for is the READDIRP line here:

[fuse.latency]
fuse.latency.STAT=AVG:2083251.687083 CNT:70407778 TOTAL:146677122302269 MIN:935 MAX:24749436028
fuse.latency.MKDIR=AVG:89140080.089119 CNT:5891 TOTAL:525124211805 MIN:850328 MAX:8979033647
fuse.latency.UNLINK=AVG:16507060.940034 CNT:29871136 TOTAL:493084662300043 MIN:428193 MAX:29228016565
fuse.latency.RMDIR=AVG:38735277.520000 CNT:100 TOTAL:3873527752 MIN:816635 MAX:5
57962652
fuse.latency.RENAME=AVG:31105025.969551 CNT:1715257 TOTAL:53353113529455 MIN:640445 MAX:38148340191
fuse.latency.LINK=AVG:14307995.568227 CNT:9736946 TOTAL:139316180216063 MIN:400463 MAX:25176183006
fuse.latency.TRUNCATE=AVG:22629765.326196 CNT:9572897 TOTAL:216632412601850 MIN:633173 MAX:25565104858
fuse.latency.OPEN=AVG:3634377.834486 CNT:577238138 TOTAL:2097901493967117 MIN:152424 MAX:24749725519
fuse.latency.READ=AVG:5284918.291782 CNT:51014487 TOTAL:269607395492193 MIN:155728 MAX:24896879224
fuse.latency.WRITE=AVG:10952421.158628 CNT:105894840 TOTAL:1159804886205515 MIN:217782 MAX:25315461979
fuse.latency.STATFS=AVG:10828980.661874 CNT:73026 TOTAL:790797141814 MIN:189957 MAX:23214431379
fuse.latency.FLUSH=AVG:3156777.677991 CNT:598447687 TOTAL:1889166299766884 MIN:113512 MAX:24791392045
fuse.latency.FSYNC=AVG:298433311.251382 CNT:769654 TOTAL:229690391737871 MIN:3504781 MAX:56811598265
fuse.latency.GETXATTR=AVG:10021966.689076 CNT:119 TOTAL:1192614036 MIN:402208 MAX:341974809
fuse.latency.OPENDIR=AVG:6563794.676099 CNT:55196546 TOTAL:362298794773872 MIN:155104 MAX:24615184835
fuse.latency.CREATE=AVG:22032016.633274 CNT:21141092 TOTAL:465780890589582 MIN:516689 MAX:38192029887
fuse.latency.FSTAT=AVG:773637.410488 CNT:3638314 TOTAL:2814735821503 MIN:857 MAX:15819816798
fuse.latency.LK=AVG:1002324394.152542 CNT:118 TOTAL:118274278510 MIN:1000153928 MAX:1043588392
fuse.latency.LOOKUP=AVG:4967828.197114 CNT:817654431 TOTAL:4061966737817049 MIN:1179 MAX:24427842746
fuse.latency.SETATTR=AVG:17617919.348145 CNT:11164386 TOTAL:196693252119561 MIN:260299 MAX:19191713553
fuse.latency.READDIRP=AVG:54595966.467200 CNT:154514342 TOTAL:8435859834533502 MIN:123815 MAX:45746229900

What's the unit for these measurements? ns?

mohit84 commented 6 months ago

It's enabled regardless.

uriel [08:16:03] ~ # gluster volume get mail cluster.readdir-optimize
Option                                   Value                                  
------                                   -----                                  
cluster.readdir-optimize                 on                                     
uriel [08:16:12] ~ # 

Yea, with the way that maildir works (create file in tmp/ then rename() into appropriate location) any kind of distribute results in absolute chaos over time. We find that a simple replicate remains best. On less IO intensive workloads (voice recordings) where we typically have three or more nodes a setup of 3 x (2 + 1) works very well, even though we tend to link() and unlink() quite a bit anyway, however, we've done a LOT of work to keep files/folder counts down to <1000/folder as far as possible and I believe 95%+ of cases is below 100/folder.. For the maildir workload though we find that when files/folder goes up, the time to getdents() goes up equally badly, and given that at any point in time we've got upwards of 5000 connected clients, with definite hotspots during the day where sequences of readdir() (getdents in kernel speak) happens ...

We're swapping courier-imap for dovecot in a couple of days which should alleviate the pressure on readdir() + open/read/close cycles quite significantly, but the core readdir problem when clients select folders (basically checks new/ folder for new files, generate UID values for these files, rename them into cur/, then readdir() loop over cur/ checking that all files have UID values, and generate for missing, create a new UID list file in tmp/ then rename into .) So that readdir loop over cur/ if that takes 10 minutes creates a really bad experience.

So, I think immediate plan of action:

Patch two of the four fuse mounts to have larger dentry_hashsize (100003 seems like a sensible larger value). Grab a statedump before remount, after remount let it run for a few hours and then get another statedump so that things can be compared.

From yesterday's statedump I'm assuming the values we're looking for is the READDIRP line here:

[fuse.latency]
fuse.latency.STAT=AVG:2083251.687083 CNT:70407778 TOTAL:146677122302269 MIN:935 MAX:24749436028
fuse.latency.MKDIR=AVG:89140080.089119 CNT:5891 TOTAL:525124211805 MIN:850328 MAX:8979033647
fuse.latency.UNLINK=AVG:16507060.940034 CNT:29871136 TOTAL:493084662300043 MIN:428193 MAX:29228016565
fuse.latency.RMDIR=AVG:38735277.520000 CNT:100 TOTAL:3873527752 MIN:816635 MAX:5
57962652
fuse.latency.RENAME=AVG:31105025.969551 CNT:1715257 TOTAL:53353113529455 MIN:640445 MAX:38148340191
fuse.latency.LINK=AVG:14307995.568227 CNT:9736946 TOTAL:139316180216063 MIN:400463 MAX:25176183006
fuse.latency.TRUNCATE=AVG:22629765.326196 CNT:9572897 TOTAL:216632412601850 MIN:633173 MAX:25565104858
fuse.latency.OPEN=AVG:3634377.834486 CNT:577238138 TOTAL:2097901493967117 MIN:152424 MAX:24749725519
fuse.latency.READ=AVG:5284918.291782 CNT:51014487 TOTAL:269607395492193 MIN:155728 MAX:24896879224
fuse.latency.WRITE=AVG:10952421.158628 CNT:105894840 TOTAL:1159804886205515 MIN:217782 MAX:25315461979
fuse.latency.STATFS=AVG:10828980.661874 CNT:73026 TOTAL:790797141814 MIN:189957 MAX:23214431379
fuse.latency.FLUSH=AVG:3156777.677991 CNT:598447687 TOTAL:1889166299766884 MIN:113512 MAX:24791392045
fuse.latency.FSYNC=AVG:298433311.251382 CNT:769654 TOTAL:229690391737871 MIN:3504781 MAX:56811598265
fuse.latency.GETXATTR=AVG:10021966.689076 CNT:119 TOTAL:1192614036 MIN:402208 MAX:341974809
fuse.latency.OPENDIR=AVG:6563794.676099 CNT:55196546 TOTAL:362298794773872 MIN:155104 MAX:24615184835
fuse.latency.CREATE=AVG:22032016.633274 CNT:21141092 TOTAL:465780890589582 MIN:516689 MAX:38192029887
fuse.latency.FSTAT=AVG:773637.410488 CNT:3638314 TOTAL:2814735821503 MIN:857 MAX:15819816798
fuse.latency.LK=AVG:1002324394.152542 CNT:118 TOTAL:118274278510 MIN:1000153928 MAX:1043588392
fuse.latency.LOOKUP=AVG:4967828.197114 CNT:817654431 TOTAL:4061966737817049 MIN:1179 MAX:24427842746
fuse.latency.SETATTR=AVG:17617919.348145 CNT:11164386 TOTAL:196693252119561 MIN:260299 MAX:19191713553
fuse.latency.READDIRP=AVG:54595966.467200 CNT:154514342 TOTAL:8435859834533502 MIN:123815 MAX:45746229900

What's the unit for these measurements? ns?

Yes latency in ns , to observe the improvement you can generate statedump for a brick process also and compare the latency for posix.latency.READDIRP with/without patch.

jkroonza commented 6 months ago

Is it possible that the counters are wrapping over?

othala [09:42:52] /var/run/gluster # grep fuse.latency.READDIRP glusterdump.*
glusterdump.13548.dump.1713339605:fuse.latency.READDIRP=AVG:132286753.552486 CNT:12994 TOTAL:1718934075661 MIN:154020 MAX:1780521538
glusterdump.24264.dump.1713270610:fuse.latency.READDIRP=AVG:73216287.520039 CNT:7302174 TOTAL:534638071105351 MIN:125582 MAX:14448484036
glusterdump.24264.dump.1713339425:fuse.latency.READDIRP=AVG:54004421.019549 CNT:2369029 TOTAL:127938039523521 MIN:123002 MAX:18264728775

Note the latter two lines where the later values for CNT and TOTAL is smaller than the earlier value. Time difference = 68815s = 19h06m55s. Or is there something going on that resets the CNT and TOTAL values?

Will overflow even be detected and handled half-properly by say resetting both values to zero? I can't imagine CNT overflowing a 64-bit unsigned, but it might be conceivable that with billions of operations over the lifetime of a process the TOTAL value might, but I don't see that happening here either (The larger total value is about 0.003% on it's way to overflowing a 64-bit unsigned and given the average response time and assuming that's accurate that's 252x10^9 readdirp operations to get to overflow which is unlikely to happen in just over 19h, 3.7m readdirp's/s isn't going to happen).

So what's the reset condition?

mohit84 commented 6 months ago

Is it possible that the counters are wrapping over?

othala [09:42:52] /var/run/gluster # grep fuse.latency.READDIRP glusterdump.*
glusterdump.13548.dump.1713339605:fuse.latency.READDIRP=AVG:132286753.552486 CNT:12994 TOTAL:1718934075661 MIN:154020 MAX:1780521538
glusterdump.24264.dump.1713270610:fuse.latency.READDIRP=AVG:73216287.520039 CNT:7302174 TOTAL:534638071105351 MIN:125582 MAX:14448484036
glusterdump.24264.dump.1713339425:fuse.latency.READDIRP=AVG:54004421.019549 CNT:2369029 TOTAL:127938039523521 MIN:123002 MAX:18264728775

Note the latter two lines where the later values for CNT and TOTAL is smaller than the earlier value. Time difference = 68815s = 19h06m55s. Or is there something going on that resets the CNT and TOTAL values?

Will overflow even be detected and handled half-properly by say resetting both values to zero? I can't imagine CNT overflowing a 64-bit unsigned, but it might be conceivable that with billions of operations over the lifetime of a process the TOTAL value might, but I don't see that happening here either (The larger total value is about 0.003% on it's way to overflowing a 64-bit unsigned and given the average response time and assuming that's accurate that's 252x10^9 readdirp operations to get to overflow which is unlikely to happen in just over 19h, 3.7m readdirp's/s isn't going to happen).

So what's the reset condition?

AFAIK the variables are reset as we do generate a statedump.

jkroonza commented 6 months ago

Yes latency in ns , to observe the improvement you can generate statedump for a brick process also and compare the latency for posix.latency.READDIRP with/without patch.

I've only patched two of the FUSE mounts so far and these don't seem to have posix.latency.READDIRP, so patched for the last few minutes:

othala [09:42:56] /var/run/gluster # grep latency.READDIRP glusterdump.13548.dump.1713339605 
fuse.latency.READDIRP=AVG:132286753.552486 CNT:12994 TOTAL:1718934075661 MIN:154020 MAX:1780521538
mail-md-cache.latency.READDIRP=AVG:132377454.727378 CNT:12996 TOTAL:1720377401637 MIN:152415 MAX:1780519112
mail-replicate-0.latency.READDIRP=AVG:132330962.358880 CNT:12996 TOTAL:1719773186816 MIN:149468 MAX:1780360229
mail-client-1.latency.READDIRP=AVG:225891537.299777 CNT:7182 TOTAL:1622353020887 MIN:180411 MAX:1780349049
mail-client-0.latency.READDIRP=AVG:16410607.859477 CNT:5814 TOTAL:95411274095 MIN:146693 MAX:1092824078

It looks like one of the bricks is worse than the other if I read that correctly?

Unpatched brick:

bagheera [10:03:51] /var/run/gluster # grep latency.READDIRP mnt-gluster-mail.28359.dump.1713273125 
mail-server.latency.READDIRP=AVG:63704012.276774 CNT:153468491 TOTAL:9776558634761999 MIN:10767 MAX:61524975965
GlusterFS 4.x v1.latency.READDIRP=AVG:70751660.038682 CNT:153466756 TOTAL:10858027747751353 MIN:12689 MAX:61536195268
mail-marker.latency.READDIRP=AVG:63692357.139715 CNT:153468400 TOTAL:9774764142460646 MIN:7223 MAX:61524968544
mail-upcall.latency.READDIRP=AVG:63688638.363521 CNT:153468488 TOTAL:9774199032428342 MIN:6328 MAX:61524964554
mail-locks.latency.READDIRP=AVG:63686122.480160 CNT:153468554 TOTAL:9773817126897025 MIN:5742 MAX:61524962007
mail-access-control.latency.READDIRP=AVG:63682552.916525 CNT:153468588 TOTAL:9773271476334359 MIN:4645 MAX:61524958348
mail-bitrot-stub.latency.READDIRP=AVG:63469633.148412 CNT:153468567 TOTAL:9740593647302478 MIN:2766 MAX:61524169086
mail-posix.latency.READDIRP=AVG:63467598.160403 CNT:153468597 TOTAL:9740283244636852 MIN:2262 MAX:61524166682

Am I interpreting this correctly as the bulk of the latency comes from the bricks rather than the FUSE mounts? Which honestly is NOT what I expected, but not impossible either.

So this means that a way to tune the value for inode_hashsize on the bricks (rather than using the fixed 2^16) would probably be of benefit here too? Not that (according to /proc) either brick has >1000 open file descriptors with the kernel. Or am I completely missing the boat here as I suspect?

I'll only be able patch brick nodes this evening (~12h from now).

mohit84 commented 6 months ago

132286753

We should generate the statedump at the same time, currently data is not comparable. As per current data on brick side avg latency for readdirp is 63ms and fuse level is 132ms but it seems strace was not showing similar level latency, it was showing > 800ms.

jkroonza commented 6 months ago

Not all cases are as pathological ... it's primarily for folders with a larger number of entries in them, but other cases do bite on this too sometimes, but then it's still 1s or so max and then you have the full listing, whereas if you have >500 entries it gets to be around 1s/500 entries, and that 1s goes up (I think the longest I've observed via strace is around 8s/getdents() call on a folder with 100k entries).

I've not got recent examples, but I also recall that sometimes it would be nice in the <500ms/call region, and then jump to like 3.5s for a call then back down for a few calls, then jump up again, and then back down.

It's this that first made me think it's probably some form of insufficient cache.

What I'm even starting to think now (based on the fact that there aren't many thousands of open fd's on brick side) is that glusterfs restarts the underlying readdir() to the OS, and I'm wondering if that's not perhaps affecting performance too, that said a find on the folder with 70k entries we had took a few seconds on the brick whereas it took several minutes on through fuse.

Based on "reset at dump time" I've now done a statedump on ALL glusterfs related processes via cssh, will post all the latency values from a similar dump of ALL processes later.

jkroonza commented 6 months ago

Underlying FS performance attributed to dir_index (ext4), which was enabled on one of the two underlying filesystems.

This is the first case ever where dir_index seems to improve performance in my experience. Enabled on the other filesystem now too, but obviously it'll take time for all folders to get this, so will see if we can downtime for e2fsck -D there to push that, but I don't think that's going to be sufficient still.

jkroonza commented 6 months ago

brick 1:

mail-server.latency.READDIRP=AVG:62908438.327520 CNT:518772 TOTAL:32635136368044 MIN:17704 MAX:20515445787
GlusterFS 4.x v1.latency.READDIRP=AVG:69772711.046489 CNT:518769 TOTAL:36195919536876 MIN:20913 MAX:20521718053
mail-marker.latency.READDIRP=AVG:62899513.061695 CNT:518779 TOTAL:32630946486633 MIN:13286 MAX:20515435727
mail-upcall.latency.READDIRP=AVG:62896289.990618 CNT:518780 TOTAL:32629337321333 MIN:11762 MAX:20515431982
mail-locks.latency.READDIRP=AVG:62894112.367854 CNT:518777 TOTAL:32628018931858 MIN:10802 MAX:20515428982
mail-access-control.latency.READDIRP=AVG:62889973.940487 CNT:518778 TOTAL:32625934900898 MIN:9131 MAX:20515425087
mail-bitrot-stub.latency.READDIRP=AVG:62669058.108445 CNT:518777 TOTAL:32511265958325 MIN:6666 MAX:20514961258
mail-posix.latency.READDIRP=AVG:62666271.128670 CNT:518776 TOTAL:32509757471047 MIN:6058 MAX:20514958950

brick 2 (previously -dir_index)

mail-server.latency.READDIRP=AVG:173454938.773763 CNT:587269 TOTAL:101864708438729 MIN:14707 MAX:9173467381
GlusterFS 4.x v1.latency.READDIRP=AVG:189910673.958458 CNT:587263 TOTAL:111527512120866 MIN:17555 MAX:9190961025
mail-marker.latency.READDIRP=AVG:173440402.336457 CNT:587263 TOTAL:101855130997315 MIN:10134 MAX:9173457622
mail-upcall.latency.READDIRP=AVG:173366521.530742 CNT:587244 TOTAL:101808449569799 MIN:8848 MAX:9173453481
mail-locks.latency.READDIRP=AVG:173192747.609798 CNT:587140 TOTAL:101688389831617 MIN:8057 MAX:9173450955
mail-access-control.latency.READDIRP=AVG:173189634.925858 CNT:587142 TOTAL:101686908629638 MIN:6594 MAX:9173447636
mail-bitrot-stub.latency.READDIRP=AVG:172928249.118110 CNT:587138 TOTAL:101532746330709 MIN:4259 MAX:9172565719
mail-posix.latency.READDIRP=AVG:172926674.990346 CNT:587142 TOTAL:101532513807182 MIN:3646 MAX:9172563368

And then the four fuse mounts:

fuse.latency.READDIRP=AVG:142837007.069393 CNT:330999 TOTAL:47278906502962 MIN:159663 MAX:18640419261
mail-md-cache.latency.READDIRP=AVG:142833382.687417 CNT:331000 TOTAL:47277849669535 MIN:158231 MAX:18640417170
mail-replicate-0.latency.READDIRP=AVG:142745121.658482 CNT:330996 TOTAL:47248064288471 MIN:154398 MAX:18640405822
mail-client-1.latency.READDIRP=AVG:210176182.369718 CNT:178988 TOTAL:37619014529991 MIN:154803 MAX:9188055527
mail-client-0.latency.READDIRP=AVG:64136175.735215 CNT:152014 TOTAL:9749596618213 MIN:149544 MAX:18640399767

fuse.latency.READDIRP=AVG:145898955.447362 CNT:285487 TOTAL:41652255093801 MIN:146481 MAX:16059080260
mail-md-cache.latency.READDIRP=AVG:145895706.609395 CNT:285488 TOTAL:41651473488503 MIN:144754 MAX:16059062840
mail-replicate-0.latency.READDIRP=AVG:145759055.076247 CNT:285482 TOTAL:41611586561277 MIN:141776 MAX:16059055710
mail-client-1.latency.READDIRP=AVG:214903639.493610 CNT:148668 TOTAL:31949294276236 MIN:153592 MAX:12110442583
mail-client-0.latency.READDIRP=AVG:70668568.258062 CNT:136816 TOTAL:9668590834795 MIN:138943 MAX:16059049243

fuse.latency.READDIRP=AVG:231156582.926632 CNT:259621 TOTAL:60013103215995 MIN:138840 MAX:9232269769
mail-md-cache.latency.READDIRP=AVG:231154786.872335 CNT:259617 TOTAL:60011712303435 MIN:135535 MAX:9232264425
mail-replicate-0.latency.READDIRP=AVG:231024003.859869 CNT:259622 TOTAL:59978913930107 MIN:129638 MAX:9231719333
mail-client-1.latency.READDIRP=AVG:230872119.728875 CNT:259615 TOTAL:59937865363412 MIN:123106 MAX:9231683928
mail-client-0.latency.READDIRP=AVG:1476876.833333 CNT:6 TOTAL:8861261 MIN:268683 MAX:6039848

fuse.latency.READDIRP=AVG:81685778.571104 CNT:229939 TOTAL:18782746238861 MIN:119917 MAX:20531674569
mail-md-cache.latency.READDIRP=AVG:81680196.738330 CNT:229935 TOTAL:18781136037028 MIN:117264 MAX:20531667406
mail-replicate-0.latency.READDIRP=AVG:81562917.728570 CNT:229938 TOTAL:18754414176672 MIN:111422 MAX:20531349141
mail-client-0.latency.READDIRP=AVG:81442053.112638 CNT:229940 TOTAL:18726785692720 MIN:105928 MAX:20531313060

One thing that bugs me here immediately: Why doesn't the fourth FUSE mount also have a mail-client-1? netstat -ntap shows it connected to both bricks. The last two mounts also run on nodes where there are bricks (and thus I suspect the discrepancy between CNT to one vs the other on number 3).

Regardless, based on this it looks like dir_index is a more significant factor than initially deduced, and a fsck -D should help significantly.

It seems that the posix translator (just above the underlying filesystem?) is about the only place where we can even think of really gaining any additional performance since this is the bulk of it.

This implies we need to:

  1. Fix the kernel to have better getdents()/readdir() on ext4; or
  2. Improve posix translator readdir performance.

Given the experience with find directly on the brick I'm not inclined to think 1 is a problem.

Looking into the posix_readdirp() and sub-calls.

Under what circumstances would GET_ANCESTRY_DENTRY_KEY on the dict be true? As far as I can tell this will be if there are quotas in effect but I'd rather get confirmation? This won't apply to me then, and the get_ancestry and resulting loop to count the returned elements doesn't apply? How does quotas affect readdirp() anyhow? This doesn't quite make sense to me.

5979         /* posix_fill_readdir performs multiple separate individual
5980            readdir() calls to fill up the buffer.
5981 
5982            In case of NFS where the same anonymous FD is shared between
5983            different applications, reading a common directory can
5984            result in the anonymous fd getting re-used unsafely between
5985            the two readdir requests (in two different io-threads).
5986 
5987            It would also help, in the future, to replace the loop
5988            around readdir() with a single large getdents() call.
5989         */

That comment is quite interesting and probably belongs in posix_fill_readdir rather than where it's called from. Looking at posix_fill_readdir() (a call to which is the only thing after the comment) has this in it:

5716 entry = sys_readdir(pfd->dir, scratch);

Which I would have said is a HUGE issue if it wasn't for the fact that glibc already uses getdents() behind the scenes, and the above wasn't simply a wrapper for glibc's entry = readdir(pfd->dir). What remains true is that we're basically copying the same data over and over so getdents() with a 256KB buffer and decoding that straight into the eventual return memory may well make sense and may well provide (probably marginal) improvements.

I do have a concern if it's possible for multiple posix_fill_readdir calls to be invoked concurrently for the same pfd by multiple clients? This would result in a lot of seeking and possible races where different clients get wrong results. I don't see that this is protected by a lock anywhere, but I may simply be looking in the wrong place.

Otherwise, busy looking though the code to see if I can spot anything obvious.

jkroonza commented 6 months ago
5714         errno = 0;
5715 
5716         entry = sys_readdir(pfd->dir, scratch);
5717 
5718         if (!entry || errno != 0) {

Is this worth fixing to !entry && errno? readdir will only return NULL on EOF or error, and for error errno gets set (for EOF it doesn't get changed, so errno=0 before sys_readdir() is correct).

mykaul commented 6 months ago

I never got to complete my work, but there were some low-hanging fruits in the readdir(p) paths. For example, in posix_readdirp_fill() we always copy the gfid, and we don't need to. I believe the same is in posix_pstat() - we don't need to memcpy() the stbuf, etc.

There are many other nano-improvements - none would be as useful as figuring out if there's any disk access we are doing that we can skip (re-use previous data or not needed at all). I've had some patches that were never completed in those areas.

jkroonza commented 6 months ago

I read that as "if there are micro-optimizations" please submit a patch and let's take the small wins where we can?

Don't mind taking on slightly bigger things either, it seems for example that readdirp() variation does a second iteration over the normal readdir() over all the data again ... this would pretty much trash cache coherency on the CPU as well, which again is a micro optimization, but looks like possibly a HUGE alteration overall.

Thinking I need to understand this code a bit better before making huge changes, so thinking a "micro optimizations" patch whilst busy trying to figure all that out is not unachievable.

mykaul commented 6 months ago

@jkroonza - the community appreciates any patch, big or small. I'm just mentioning that from my experience, the smaller improvements are all negligible, almost noise-level, when compared to removal of a disk access (even if it's already in the cache). There are countless of those that can be made, if you trace the calls that readdir(p) does.

Code that goes twice over the results (I believe fuse does it as well - see fuse_readdir_cbk() for example) probably does a single pass to see how much memory is needed to get all entries and the 2nd to actually populate it.

jkroonza commented 6 months ago

Looks like the dentry_hashsize did make some difference, only dumping the bricks though:

mail-server.latency.READDIRP=AVG:15283214.156492 CNT:738912 TOTAL:11292950338802 MIN:18181 MAX:7193343828
GlusterFS 4.x v1.latency.READDIRP=AVG:18632435.681057 CNT:738901 TOTAL:13767525357169 MIN:21305 MAX:7202006177
mail-marker.latency.READDIRP=AVG:15281446.939934 CNT:738987 TOTAL:11292790629801 MIN:13723 MAX:8809746297
mail-upcall.latency.READDIRP=AVG:15276948.136577 CNT:739019 TOTAL:11289954934945 MIN:12141 MAX:8809742548
mail-locks.latency.READDIRP=AVG:15273970.209427 CNT:739045 TOTAL:11288151313426 MIN:11166 MAX:8809739845
mail-access-control.latency.READDIRP=AVG:15270484.337266 CNT:739040 TOTAL:11285498744613 MIN:9599 MAX:8809736633
mail-bitrot-stub.latency.READDIRP=AVG:15081206.223784 CNT:739046 TOTAL:11145705134863 MIN:7006 MAX:8809481998
mail-posix.latency.READDIRP=AVG:15078848.549045 CNT:739059 TOTAL:11144158729809 MIN:6199 MAX:8809479929

average here is now 15ms, was previously around 63ms (with dir_index).

mail-server.latency.READDIRP=AVG:37897526.885809 CNT:901105 TOTAL:34149650964437 MIN:13348 MAX:42279213282
GlusterFS 4.x v1.latency.READDIRP=AVG:60394336.058679 CNT:901088 TOTAL:54420611490443 MIN:16051 MAX:42284910441
mail-marker.latency.READDIRP=AVG:37884734.684326 CNT:901109 TOTAL:34138275386658 MIN:9216 MAX:42279202593
mail-upcall.latency.READDIRP=AVG:37880405.609806 CNT:901110 TOTAL:34134412299052 MIN:7929 MAX:42279198478
mail-locks.latency.READDIRP=AVG:37874540.872889 CNT:901186 TOTAL:34132005991075 MIN:7164 MAX:42279195623
mail-access-control.latency.READDIRP=AVG:37870559.851467 CNT:901188 TOTAL:34128494091424 MIN:5849 MAX:42279191830
mail-bitrot-stub.latency.READDIRP=AVG:37573199.993257 CNT:901187 TOTAL:33860479382323 MIN:3704 MAX:42278385113
mail-posix.latency.READDIRP=AVG:37600557.346989 CNT:901194 TOTAL:33885396677762 MIN:3008 MAX:42278382834

average here is now 37ms, previously 173ms. dir_index here was enabled before both "stretches", however, I don't know how much of the filesystem has been dir_indexed in the meantime, so I think a part of this could be dir_index kicking in, but I doubt all of it. Will issue a fsck -D late tonight.

Also, we perform touch .../$(hostname) on all fuse mounts from time to time and alert if the time taken for this exceeds 5s, this normally went completely nuts during a find over the full filesystem, whilst not completely silent it's definitely much better now.

Looking at the above figures my gut of an ~ 4x improvement holds, so for ~7x the amount of "base table" size memory we gain 4x performance boost. Looking at list.h each list_head is 16 bytes, so at 100003 entries that's 1.6MB of RAM, given that, I'm happy to x10 that at least once more if it'll provide additional gains. I thus think it's critical that this value becomes configurable.

It also does seem (non-scientific) to reduce disk IO requirements. iostat says that both the nvme and underlying disk is quieter, however, rrd io graphs says that overall read throughput has increased (write is ~ the same). This is conflicting, as there has been no reduction (rrd graphs) in disk write on either medium. The iostat %util has gone down drastically though. As has load average values. Could just be the time of the day as well, so I might be comparing apples with oranges. We do have similar number of client connections online though.

Is there a way to see how full the hash tables are getting and how many entries are in the hash tables (would also allow calculating average chain length).

IIRC the hash chains are double-linked lists, and a previous suggestion was that the overhead to maintain this was higher than the gain (I think the big benefit is when unlinking entries) compared to single-linked lists (to remove you need to re-iterate to find prev, on short chains that's OK). I suspect this will be more of a memory saving again more so than any real performance gain though.

mohit84 commented 6 months ago

Looks like the dentry_hashsize did make some difference, only dumping the bricks though:

mail-server.latency.READDIRP=AVG:15283214.156492 CNT:738912 TOTAL:11292950338802 MIN:18181 MAX:7193343828
GlusterFS 4.x v1.latency.READDIRP=AVG:18632435.681057 CNT:738901 TOTAL:13767525357169 MIN:21305 MAX:7202006177
mail-marker.latency.READDIRP=AVG:15281446.939934 CNT:738987 TOTAL:11292790629801 MIN:13723 MAX:8809746297
mail-upcall.latency.READDIRP=AVG:15276948.136577 CNT:739019 TOTAL:11289954934945 MIN:12141 MAX:8809742548
mail-locks.latency.READDIRP=AVG:15273970.209427 CNT:739045 TOTAL:11288151313426 MIN:11166 MAX:8809739845
mail-access-control.latency.READDIRP=AVG:15270484.337266 CNT:739040 TOTAL:11285498744613 MIN:9599 MAX:8809736633
mail-bitrot-stub.latency.READDIRP=AVG:15081206.223784 CNT:739046 TOTAL:11145705134863 MIN:7006 MAX:8809481998
mail-posix.latency.READDIRP=AVG:15078848.549045 CNT:739059 TOTAL:11144158729809 MIN:6199 MAX:8809479929

average here is now 15ms, was previously around 63ms (with dir_index).

mail-server.latency.READDIRP=AVG:37897526.885809 CNT:901105 TOTAL:34149650964437 MIN:13348 MAX:42279213282
GlusterFS 4.x v1.latency.READDIRP=AVG:60394336.058679 CNT:901088 TOTAL:54420611490443 MIN:16051 MAX:42284910441
mail-marker.latency.READDIRP=AVG:37884734.684326 CNT:901109 TOTAL:34138275386658 MIN:9216 MAX:42279202593
mail-upcall.latency.READDIRP=AVG:37880405.609806 CNT:901110 TOTAL:34134412299052 MIN:7929 MAX:42279198478
mail-locks.latency.READDIRP=AVG:37874540.872889 CNT:901186 TOTAL:34132005991075 MIN:7164 MAX:42279195623
mail-access-control.latency.READDIRP=AVG:37870559.851467 CNT:901188 TOTAL:34128494091424 MIN:5849 MAX:42279191830
mail-bitrot-stub.latency.READDIRP=AVG:37573199.993257 CNT:901187 TOTAL:33860479382323 MIN:3704 MAX:42278385113
mail-posix.latency.READDIRP=AVG:37600557.346989 CNT:901194 TOTAL:33885396677762 MIN:3008 MAX:42278382834

average here is now 37ms, previously 173ms. dir_index here was enabled before both "stretches", however, I don't know how much of the filesystem has been dir_indexed in the meantime, so I think a part of this could be dir_index kicking in, but I doubt all of it. Will issue a fsck -D late tonight.

Also, we perform touch .../$(hostname) on all fuse mounts from time to time and alert if the time taken for this exceeds 5s, this normally went completely nuts during a find over the full filesystem, whilst not completely silent it's definitely much better now.

Looking at the above figures my gut of an ~ 4x improvement holds, so for ~7x the amount of "base table" size memory we gain 4x performance boost. Looking at list.h each list_head is 16 bytes, so at 100003 entries that's 1.6MB of RAM, given that, I'm happy to x10 that at least once more if it'll provide additional gains. I thus think it's critical that this value becomes configurable.

It also does seem (non-scientific) to reduce disk IO requirements. iostat says that both the nvme and underlying disk is quieter, however, rrd io graphs says that overall read throughput has increased (write is ~ the same). This is conflicting, as there has been no reduction (rrd graphs) in disk write on either medium. The iostat %util has gone down drastically though. As has load average values. Could just be the time of the day as well, so I might be comparing apples with oranges. We do have similar number of client connections online though.

Is there a way to see how full the hash tables are getting and how many entries are in the hash tables (would also allow calculating average chain length).

IIRC the hash chains are double-linked lists, and a previous suggestion was that the overhead to maintain this was higher than the gain (I think the big benefit is when unlinking entries) compared to single-linked lists (to remove you need to re-iterate to find prev, on short chains that's OK). I suspect this will be more of a memory saving again more so than any real performance gain though.

Thanks for sharing the data, it seems it is a big improvement. Do we have any data how much improvement we are getting without dir_index change.The statedump should print the details about the hash table, if it is not printing as per your needs we have to change the code. I think to troubleshoot further you can capture perf data and generate a flaemgraph, the data can be useful to judge about the bottleneck, is the function posix_fill_readdir is taking time or posix_readdirp_fill is taking time. To capture the perf you can use below command perf record -p pidof glusterfsd -F 99 --call-graph dwarf -- sleep 180 To generate a flamegraph follow the below steps git clone https://github.com/brendangregg/FlameGraph perf script | ./stackcollapse-perf.pl > out.perf-folded ./flamegraph.pl out.perf-folded > perf.svg

jkroonza commented 6 months ago
243M Apr 17 16:10 mnt-gluster-mail.22478.dump.1713363044
240M Apr 17 16:10 mnt-gluster-mail.6597.dump.1713363045

What am I looking for?

On the fuse mounts I can see these sections which are semi useful but also not useful enough:

[xlator.mount.fuse.itable]
xlator.mount.fuse.itable.dentry_hashsize=100003
xlator.mount.fuse.itable.inode_hashsize=2097152
xlator.mount.fuse.itable.name=meta-autoload/inode
xlator.mount.fuse.itable.lru_limit=2097152
xlator.mount.fuse.itable.active_size=4702
xlator.mount.fuse.itable.lru_size=1707093
xlator.mount.fuse.itable.purge_size=0
xlator.mount.fuse.itable.invalidate_size=0

On the bricks I'm seeing:

[GlusterFS Handshake]
GlusterFS Handshake.program-number=14398633
GlusterFS Handshake.program-version=2
GlusterFS Handshake.latency.SETVOLUME=AVG:707064.200000 CNT:5 TOTAL:3535321 MIN:
362821 MAX:2294532
conn.0.bound_xl./mnt/gluster/mail.dentry_hashsize=100003
conn.0.bound_xl./mnt/gluster/mail.inode_hashsize=65536
conn.0.bound_xl./mnt/gluster/mail.name=/mnt/gluster/mail/inode
conn.0.bound_xl./mnt/gluster/mail.lru_limit=1048576
conn.0.bound_xl./mnt/gluster/mail.active_size=5597
conn.0.bound_xl./mnt/gluster/mail.lru_size=1047510
conn.0.bound_xl./mnt/gluster/mail.purge_size=0
conn.0.bound_xl./mnt/gluster/mail.invalidate_size=0

But that doesn't say anything about how used the dentry_hashsize is, only inode_hashsize. And this again shows why I patched fuse mount to support setting the inode_hashsize larger. Those chains are average of length 8 here, and lowering the chain length here would also improve performance. What I find interesting is that active_size is actually really tiny, but experience have shown that lowering lru_limit causes excessive disk IO to the point that even with a 97% effective NVMe based dm-cache things gets stalled badly.

So as it stands I think:

  1. Upper limit on network.inode-lru-limit should be removed (or set to something insanely stupid large like 64m).
  2. Ability to set network.inode-lru-size (same upper limit as inode-lru-limit).
  3. Ability to set the dentry_hashsize (network.dentry-hash-size ?).
  4. Output details around dentry_hash usage.

1 is trivial and I think that 2+3 (3 requires a prime validator or do we just document that the number should be prime but not enforce it?) should be fairly easy, I don't know about 4 since there may be no counter available to base this on.

I'll check if I can get the perf stuff working, the package I've got is https://perf.wiki.kernel.org/ - is that the correct one? I was contemplating recompiling with gproff and working from there, but you're more familiar with perf so I think it's more useful for me to learn a new tool than to blatantly waste your time.

mohit84 commented 6 months ago
243M Apr 17 16:10 mnt-gluster-mail.22478.dump.1713363044
240M Apr 17 16:10 mnt-gluster-mail.6597.dump.1713363045

What am I looking for?

On the fuse mounts I can see these sections which are semi useful but also not useful enough:

[xlator.mount.fuse.itable]
xlator.mount.fuse.itable.dentry_hashsize=100003
xlator.mount.fuse.itable.inode_hashsize=2097152
xlator.mount.fuse.itable.name=meta-autoload/inode
xlator.mount.fuse.itable.lru_limit=2097152
xlator.mount.fuse.itable.active_size=4702
xlator.mount.fuse.itable.lru_size=1707093
xlator.mount.fuse.itable.purge_size=0
xlator.mount.fuse.itable.invalidate_size=0

On the bricks I'm seeing:

[GlusterFS Handshake]
GlusterFS Handshake.program-number=14398633
GlusterFS Handshake.program-version=2
GlusterFS Handshake.latency.SETVOLUME=AVG:707064.200000 CNT:5 TOTAL:3535321 MIN:
362821 MAX:2294532
conn.0.bound_xl./mnt/gluster/mail.dentry_hashsize=100003
conn.0.bound_xl./mnt/gluster/mail.inode_hashsize=65536
conn.0.bound_xl./mnt/gluster/mail.name=/mnt/gluster/mail/inode
conn.0.bound_xl./mnt/gluster/mail.lru_limit=1048576
conn.0.bound_xl./mnt/gluster/mail.active_size=5597
conn.0.bound_xl./mnt/gluster/mail.lru_size=1047510
conn.0.bound_xl./mnt/gluster/mail.purge_size=0
conn.0.bound_xl./mnt/gluster/mail.invalidate_size=0

But that doesn't say anything about how used the dentry_hashsize is, only inode_hashsize. And this again shows why I patched fuse mount to support setting the inode_hashsize larger. Those chains are average of length 8 here, and lowering the chain length here would also improve performance. What I find interesting is that active_size is actually really tiny, but experience have shown that lowering lru_limit causes excessive disk IO to the point that even with a 97% effective NVMe based dm-cache things gets stalled badly.

So as it stands I think:

  1. Upper limit on network.inode-lru-limit should be removed (or set to something insanely stupid large like 64m).
  2. Ability to set network.inode-lru-size (same upper limit as inode-lru-limit).
  3. Ability to set the dentry_hashsize (network.dentry-hash-size ?).
  4. Output details around dentry_hash usage.

1 is trivial and I think that 2+3 (3 requires a prime validator or do we just document that the number should be prime but not enforce it?) should be fairly easy, I don't know about 4 since there may be no counter available to base this on.

I'll check if I can get the perf stuff working, the package I've got is https://perf.wiki.kernel.org/ - is that the correct one? I was contemplating recompiling with gproff and working from there, but you're more familiar with perf so I think it's more useful for me to learn a new tool than to blatantly waste your time.

It is true currently there is no counter available specific to dentry usage, i think it is not difficult to generate a new counter and save the number.

mykaul commented 6 months ago

It is true currently there is no counter available specific to dentry usage, i think it is not difficult to generate a new counter and save the number.

Just like other counters, I think it's the wrong approach. Count when you dump. All those 'inode->table->active_size++;' and 'inode->table->active_size--;' are not needed - as we take a lock when dumping anyway, let's just count the number of entries in the purge, lru and active lists.

mohit84 commented 6 months ago

It is true currently there is no counter available specific to dentry usage, i think it is not difficult to generate a new counter and save the number.

Just like other counters, I think it's the wrong approach. Count when you dump. All those 'inode->table->active_size++;' and 'inode->table->active_size--;' are not needed - as we take a lock when dumping anyway, let's just count the number of entries in the purge, lru and active lists.

Yes i was also referring the same, we don;t need to generate a counter, during dump we can count the entries.

jkroonza commented 6 months ago

4339 starts with the four points I mentioned. I think 1 is complete, and I've started with 2, and is as far as I'm going to get today, would like to know if this looks OK so far at least.