gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.54k stars 1.07k forks source link

Consistent slow write speed regardless of settings changed, low CPU%, low disk usage, low memory usage. #3969

Open firsttimegithubuser opened 1 year ago

firsttimegithubuser commented 1 year ago

Description of problem: I'm unsure what is causing such slow write speeds. Slower speeds than most posts I've seen from the mailing list and other online resources. I'm using rsync to transfer 98 files for a total of 7.4MB in size from ServerB OS drive to itself using either glusterfs mount or nfs-ganesha.

Nothing seems to affect the transfer speeds regardless of glusterfs volume settings, fuse mount options, regardless of how many threads, if I use direct-io mode or not, regardless if I use nfs-ganesha or glusterfs/fuse mount (nfs-ganesha actually adds a minute to transfer times), even after changing some kernel/sysctl options, reboots, upgrades, under almost every circumstance the transfer speeds are about the same, or slower.

I have a new 3-node replica cluster. Each node is newly provisioned with 5 disks; 1 OS drive, 4 drives in RAID-10 , to store glusterfs volume data.

Temporarily, ServerA has 4x HDD drives in RAID-10, but we have tested removing the node from the replica group and just tested with 2 servers with 4xSSD drives in RAID-10 with little improvement in transfer speed. ServerA uses RAID card LSI MegaRAID SAS-3 3108 ServerB and ServerC use LSI MegaRAID SAS 9361-8i SuperMicro boards for all nodes. CPU for all nodes: 1x Intel(R) Xeon(R) E-2236 CPU @ 3.40GHz

RAID-10 drives are created using LVM and use these /etc/fstab entries

/dev/vg_data/lv_data    /var/lib/gvol0  xfs     rw,nouuid,inode64,noatime,nodiratime,swalloc,logbsize=256k      1       2

I'm using ServerB to mount ServerB's own volume. /etc/fstab for gluster mount

10.0.2.50:/test /mnt glusterfs defaults,_netdev,negative-timeout=10,attribute-timeout=30,fopen-keep-cache,fetch-attempts=5 0 0

I tried testing from a non-glusterfs server but speeds didn't improve much that I could see.

sent 8,177,725 bytes  received 1,989 bytes  33,049.35 bytes/sec
total size is 8,166,877  speedup is 1.00

real    4m7.201s
user    0m0.140s
sys     0m0.184s

Each node is in their own geographically-separate data center. Each data center is connected by two network connections, one public IP, one private IP. We're using a VLAN over WAN for the private IP network. Firewall is setup to allow these three nodes to freely connect with each other over private IP (for now).

Sometimes the network between them may become severed for an undetermined amount of time but we're usually aware of it when it happens. There were no reports of network outages with these nodes or any of our other servers in the same data centers at the time of testing.

Each nodes' /etc/hosts file is updated with the hostname/internal-IP of each node.

ServerA ping to ServerB avg time=93.2 ms
ServerA ping to ServerC avg time=34.7 ms

ServerB ping to ServerA avg time=93.2 ms
ServerB ping to ServerC avg time=59.8 ms

ServerC ping to ServerA avg time=34.7 ms
ServerC ping to ServerB avg time=59.7 ms

I've tried using Ubuntu 20.04 focal glusterfs package 7.2-2build1, gluster.org repo version 7.x (on accident) and version 10.3 from gluster.org repo. With Ubuntu-focal and Gluster 7.9, some settings caused glusterfsd process on every node to spike 100%. Sometimes the glustershd process would suddenly use 100% CPU. I assume because it was trying to heal a stalled transfer. I eventually realized I wasn't using the latest version of GlusterFS from 7.9 to 10.3 and upgraded from Ubuntu 20.04 focal to Ubuntu 22.04 jammy. Jammy upgrade seems to have stopped glusterd from going 100%, now we usually see 0.7-1.0 CPU usage on glusterfsd/glusterd.

The exact command to reproduce the issue: I also tried rsync with the -W flag and also -B=131072 with no real change in transfer speed or time.

> time rsync -av --progress /home/shares/folder/ /mnt/folder/
...
sent 7,476,798 bytes  received 1,913 bytes  37,300.30 bytes/sec
total size is 7,466,443  speedup is 1.00

real    3m20.111s
user    0m0.033s
sys     0m0.116s

cp command takes approximately as long.

real 2m51.661s user 0m0.007s sys 0m0.060s

rsync with ssh from ServerB to the gvol folder itself on ServerA (the slowest server by ping), bypassing nfs/fuse mount but using the slower RAID-10 4x HDD drives:

time rsync -av --progress /home/shares/folder/ user@ServerA:/var/lib/gvol0/folder/
sent 7,476,798 bytes  received 1,922 bytes  2,136,777.14 bytes/sec
total size is 7,466,443  speedup is 1.00

real    0m3.417s
user    0m0.115s
sys     0m0.048s

rsync from ServerB to ServerC, same test as above, over ssh instead of nfs/fuse mount.

time rsync -av --progress /home/shares/folder/ user@ServerC:/var/lib/gvol0/folder/
sent 7,476,798 bytes  received 1,922 bytes  2,991,488.00 bytes/sec
total size is 7,466,443  speedup is 1.00

real    0m2.213s
user    0m0.068s
sys     0m0.058s

The full output of the command that failed: No failed command.

Expected results: At least 1,000,000 bytes/sec speeds

Mandatory info: - The output of the gluster volume info command:

# gluster v info test

Volume Name: test
Type: Replicate
Volume ID: 94ff825e-8204-48bc-abd1-4eff9a98d6bf
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: ServerA:/var/lib/gvol0/test
Brick2: ServerB:/var/lib/gvol0/test
Brick3: ServerC:/var/lib/gvol0/test
Options Reconfigured:
performance.io-cache-pass-through: true
performance.iot-pass-through: true
performance.nl-cache-pass-through: true
performance.write-behind-pass-through: true
performance.md-cache-pass-through: true
performance.read-after-open: yes
performance.read-ahead: off
performance.quick-read: off
performance.open-behind: on
performance.low-prio-threads: 16
network.compression: on
config.global-threading: on
features.cache-invalidation: on
network.remote-dio: enable
client.event-threads: 4
performance.strict-o-direct: on
server.event-threads: 4
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on
performance.write-behind: off
performance.io-thread-count: 64
network.ping-timeout: 20
performance.io-cache: off
performance.readdir-ahead: off
cluster.readdir-optimize: on
performance.least-prio-threads: 16
performance.high-prio-threads: 16
performance.normal-prio-threads: 16
performance.nl-cache-positive-entry: on
performance.aggregate-size: 256KB
features.cache-invalidation-timeout: 600
performance.nl-cache: on
performance.nl-cache-timeout: 600
network.inode-lru-limit: 200000
performance.write-behind-window-size: 4MB
server.outstanding-rpc-limit: 512
performance.cache-size: 2GB
performance.stat-prefetch: off
performance.cache-invalidation: on
performance.md-cache-timeout: 600
performance.parallel-readdir: on
performance.rda-cache-limit: 128MB
cluster.enable-shared-storage: disable
...
<this is not a complete list of settings ever tested on previous test volumes>

- The output of the gluster volume status command:

gluster v status test

Status of volume: test Gluster process TCP Port RDMA Port Online Pid

Brick ServerA:/var/lib/gvol0/test 60660 0 Y 179250 Brick ServerB:/var/lib/gvol0/test 59604 0 Y 187054 Brick ServerC:/var/lib/gvol0/test 59573 0 Y 175714 Self-heal Daemon on localhost N/A N/A Y 187097 Self-heal Daemon on ServerC N/A N/A Y 175757 Self-heal Daemon on ServerA N/A N/A Y 179293

Task Status of Volume test

There are no active volume tasks

- The output of the gluster volume heal command:

gluster v heal test info

Brick ServerA:/var/lib/gvol0/test Status: Connected Number of entries: 0

Brick ServerB:/var/lib/gvol0/test Status: Connected Number of entries: 0

Brick ServerC:/var/lib/gvol0/test Status: Connected Number of entries: 0

**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

glusterd.log glustershd.log mnt.log var-lib-gvol0-test.log

**- Is there any crash ? Provide the backtrace and coredump No crash.

Additional info:

# ps aux | grep gluster
root      251637  0.0  0.0 547656 23044 ?        SLsl 20:24   0:00 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
root      251723  0.3  0.0 1327984 25412 ?       SLsl 20:25   0:01 /usr/sbin/glusterfsd -s ServerB --volfile-id test.ServerB.var-lib-gvol0-test -p /var/run/gluster/vols/test/ServerB-var-lib-gvol0-test.pid -S /var/run/gluster/f9414b10f5fe797d.socket --brick-name /var/lib/gvol0/test -l /var/log/glusterfs/bricks/var-lib-gvol0-test.log --xlator-option *-posix.glusterd-uuid=3ec449cd-a1e8-42dd-85d3-53b85a95bb2c --process-name brick --brick-port 50126 --xlator-option test-server.listen-port=50126
root      251757  0.0  0.0 684180 16204 ?        SLsl 20:25   0:00 /usr/sbin/glusterfs -s localhost --volfile-id shd/test -p /var/run/gluster/shd/test/test-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/a42bb6b1c3f28291.socket --xlator-option *replicate*.node-uuid=3ec449cd-a1e8-42dd-85d3-53b85a95bb2c --process-name glustershd --client-pid=-6
root      251906  0.5  0.0 752288 20372 ?        SLsl 20:25   0:02 /usr/sbin/glusterfs --volfile-max-fetch-attempts=5 --fopen-keep-cache=true --attribute-timeout=30 --negative-timeout=10 --process-name fuse --volfile-server=10.0.2.50 --volfile-id=/test /mnt

top -H during rsync over glusterfs mount

 232515 root      20   0  817820  21116   7804 S   0.7   0.0   0:00.35 glfs_epoll000
    555 root      rt   0  289480  27100   9072 S   0.3   0.0   1:01.41 multipathd
  26743 root      rt   0  237896 172296  60704 S   0.3   0.3  42:05.01 corosync
 187078 root      20   0 2655968  29212   9324 S   0.3   0.0   0:19.19 glfs_worker/4
 187079 root      20   0 2655968  29212   9324 S   0.3   0.0   0:00.53 glfs_epoll000
 187155 root      20   0 2655968  29212   9324 S   0.3   0.0   0:00.52 glfs_tpw12
 232516 root      20   0  817820  21116   7804 S   0.3   0.0   0:00.56 glfs_epoll001
 232520 root      20   0  817820  21116   7804 S   0.3   0.0   0:01.07 glfs_fuseproc
 251205 root      20   0   13488   4444   3372 R   0.3   0.0   0:00.28 top
      1 root      20   0  167888  13440   8312 S   0.0   0.0   0:41.52 systemd

netstat -tapn | grep gluster

# netstat -tapn|grep gluster
tcp        0      0 0.0.0.0:59604           0.0.0.0:*               LISTEN      187054/glusterfsd
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN      175261/glusterd
tcp        0    588 10.0.2.50:49137         10.0.4.50:60660         ESTABLISHED 232493/glusterfs
tcp        0      0 10.0.2.50:24007         10.0.2.50:49147         ESTABLISHED 175261/glusterd
tcp        0      0 10.0.2.50:49147         10.0.2.50:24007         ESTABLISHED 232493/glusterfs
tcp        0      0 10.0.2.50:24007         10.0.1.50:49148         ESTABLISHED 175261/glusterd
tcp        0      0 10.0.2.50:24007         10.0.4.50:49149         ESTABLISHED 175261/glusterd
tcp        0      0 127.0.0.1:24007         127.0.0.1:49150         ESTABLISHED 175261/glusterd
tcp        0      0 10.0.2.50:24007         10.0.2.50:49151         ESTABLISHED 175261/glusterd
tcp        0      0 10.0.2.50:59604         10.0.1.50:49143         ESTABLISHED 187054/glusterfsd
tcp        0      0 10.0.2.50:49144         10.0.1.50:59573         ESTABLISHED 187097/glusterfs
tcp        0      0 10.0.2.50:49151         10.0.2.50:24007         ESTABLISHED 187054/glusterfsd
tcp        0      0 10.0.2.50:59604         10.0.2.50:49138         ESTABLISHED 187054/glusterfsd
tcp        0      0 10.0.2.50:49143         10.0.4.50:60660         ESTABLISHED 187097/glusterfs
tcp        0    588 10.0.2.50:49139         10.0.1.50:59573         ESTABLISHED 232493/glusterfs
tcp        0      0 10.0.2.50:49149         10.0.4.50:24007         ESTABLISHED 175261/glusterd
tcp        0      0 10.0.2.50:59604         10.0.2.50:49150         ESTABLISHED 187054/glusterfsd
tcp        0      0 10.0.2.50:49148         10.0.1.50:24007         ESTABLISHED 175261/glusterd
tcp        0      0 10.0.2.50:49138         10.0.2.50:59604         ESTABLISHED 232493/glusterfs
tcp        0      0 10.0.2.50:49150         10.0.2.50:59604         ESTABLISHED 187097/glusterfs
tcp        0      0 127.0.0.1:49150         127.0.0.1:24007         ESTABLISHED 187097/glusterfs
tcp        0      0 10.0.2.50:59604         10.0.4.50:49143         ESTABLISHED 187054/glusterfsd

random snippet of iotop

 251990 be/4 root        0.00 B/s  134.17 K/s  ?unavailable?  glusterfsd -s ServerB --volfile-id test.ServerB.var-lib-gvol0-test -p /var/ru~ brick --brick-port 50126 --xlator-option test-server.listen-port=50126 [glfs_rpcrqhnd]

also tried rsync with global-threading turned off

sent 7,476,798 bytes  received 1,913 bytes  35,697.90 bytes/sec
total size is 7,466,443  speedup is 1.00

real    3m28.816s
user    0m0.043s
sys     0m0.104s

gluster volume profile with just 2 node replica (slowest, ServerA removed) during rsync transfer over glusterfs mount:

# gluster v profile test info
Brick: ServerC:/var/lib/gvol0/test
---------------------------------------
Cumulative Stats:
   Block Size:                256b+                1024b+                2048b+
 No. of Reads:                    0                     0                     0
No. of Writes:                    1                     1                     1

   Block Size:               4096b+                8192b+               16384b+
 No. of Reads:                    0                     0                     0
No. of Writes:                    7                    14                    29

   Block Size:              32768b+               65536b+
 No. of Reads:                    0                     0
No. of Writes:                   22                    52
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 ns       0.00 ns       0.00 ns            102     RELEASE
      0.00       0.00 ns       0.00 ns       0.00 ns              4  RELEASEDIR
      0.04  102582.00 ns  101049.00 ns  104115.00 ns              2    GETXATTR
      0.14  723812.00 ns  723812.00 ns  723812.00 ns              1       MKDIR
      0.53   27459.89 ns   25033.00 ns   49705.00 ns            102       FLUSH
      1.24   31876.05 ns   14530.00 ns   44415.00 ns            204    FINODELK
      2.23   19208.52 ns    5096.00 ns   54333.00 ns            610     ENTRYLK
      3.52   61175.47 ns   16214.00 ns  113272.00 ns            303        STAT
      5.21   33529.82 ns    5736.00 ns   67552.00 ns            817     INODELK
      6.82  354995.10 ns   89002.00 ns  457720.00 ns            101      RENAME
      7.89  203277.65 ns   87804.00 ns  369912.00 ns            204    FXATTROP
     10.71  552172.53 ns  142296.00 ns 1232477.00 ns            102      CREATE
     13.16  146026.82 ns   34843.00 ns  305334.00 ns            474      LOOKUP
     18.00  232066.40 ns   62259.00 ns  344174.00 ns            408     SETATTR
     30.52 1263659.54 ns  225094.00 ns 3272255.00 ns            127       WRITE
      0.00       0.00 us       0.00 us       0.00 us              3      UPCALL
      0.00       0.00 us       0.00 us       0.00 us              3     CI_IATT

    Duration: 254 seconds
   Data Read: 0 bytes
Data Written: 7818496 bytes

Interval 5 Stats:
   Block Size:                256b+               65536b+
 No. of Reads:                    0                     0
No. of Writes:                    1                     5
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 ns       0.00 ns       0.00 ns              3     RELEASE
      0.35   26720.67 ns   25987.00 ns   27873.00 ns              3       FLUSH
      0.79   30220.50 ns   15480.00 ns   37252.00 ns              6    FINODELK
      1.20   21250.77 ns    6207.00 ns   48147.00 ns             13     ENTRYLK
      2.64   67520.44 ns   50088.00 ns   84189.00 ns              9        STAT
      2.71   32893.11 ns   11365.00 ns   37517.00 ns             19     INODELK
      3.13  360579.00 ns  353077.00 ns  368081.00 ns              2      RENAME
      5.17  148917.75 ns   47998.00 ns  271313.00 ns              8      LOOKUP
      5.38  206541.17 ns  171719.00 ns  230029.00 ns              6    FXATTROP
      7.70  591799.00 ns  567109.00 ns  611256.00 ns              3      CREATE
      9.00  230505.89 ns  115931.00 ns  344174.00 ns              9     SETATTR
     61.94 2379513.17 ns  290192.00 ns 3245498.00 ns              6       WRITE

    Duration: 4 seconds
   Data Read: 0 bytes
Data Written: 605245 bytes

Brick: ServerB:/var/lib/gvol0/test
---------------------------------------
Cumulative Stats:
   Block Size:                256b+                1024b+                2048b+
 No. of Reads:                    0                     0                     0
No. of Writes:                    1                     1                     1

   Block Size:               4096b+                8192b+               16384b+
 No. of Reads:                    0                     0                     0
No. of Writes:                    7                    14                    29

   Block Size:              32768b+               65536b+
 No. of Reads:                    0                     0
No. of Writes:                   22                    52
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 ns       0.00 ns       0.00 ns            102     RELEASE
      0.00       0.00 ns       0.00 ns       0.00 ns              4  RELEASEDIR
      0.04   78752.00 ns   69788.00 ns   87716.00 ns              2    GETXATTR
      0.14  620627.00 ns  620627.00 ns  620627.00 ns              1       MKDIR
      0.57   24076.24 ns   12840.00 ns   45073.00 ns            102       FLUSH
      1.30   27668.04 ns   14663.00 ns   46119.00 ns            204    FINODELK
      1.69   70325.96 ns   65241.00 ns   97603.00 ns            104        STAT
      2.47   17552.59 ns    1434.00 ns   60654.00 ns            610     ENTRYLK
      5.43   28788.65 ns    9649.00 ns   72360.00 ns            817     INODELK
      6.87  294673.65 ns  148965.00 ns  377861.00 ns            101      RENAME
      8.25  175233.37 ns   50765.00 ns  463201.00 ns            204    FXATTROP
     10.96  465471.14 ns  158565.00 ns 2408443.00 ns            102      CREATE
     14.00  127929.48 ns   50676.00 ns  281124.00 ns            474      LOOKUP
     18.50  195939.17 ns   77393.00 ns  262767.00 ns            409     SETATTR
     29.77 1015228.72 ns  217151.00 ns 2521481.00 ns            127       WRITE
      0.00       0.00 us       0.00 us       0.00 us              3      UPCALL
      0.00       0.00 us       0.00 us       0.00 us              3     CI_IATT

    Duration: 254 seconds
   Data Read: 0 bytes
Data Written: 7818496 bytes

Interval 5 Stats:
   Block Size:                256b+               65536b+
 No. of Reads:                    0                     0
No. of Writes:                    1                     5
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 ns       0.00 ns       0.00 ns              3     RELEASE
      0.39   23135.33 ns   22375.00 ns   23577.00 ns              3       FLUSH
      0.79   69578.50 ns   69262.00 ns   69895.00 ns              2        STAT
      1.00   29467.67 ns   19494.00 ns   38894.00 ns              6    FINODELK
      1.20   16356.00 ns    6906.00 ns   22548.00 ns             13     ENTRYLK
      3.11   28968.74 ns   23521.00 ns   35407.00 ns             19     INODELK
      3.35  296084.50 ns  294079.00 ns  298090.00 ns              2      RENAME
      5.47  483866.50 ns  457122.00 ns  510611.00 ns              2      CREATE
      5.87  173101.00 ns  152088.00 ns  197443.00 ns              6    FXATTROP
      6.25  138338.25 ns   95109.00 ns  226574.00 ns              8      LOOKUP
     11.11  196662.60 ns  175508.00 ns  222490.00 ns             10     SETATTR
     61.47 1813708.83 ns  217151.00 ns 2479697.00 ns              6       WRITE

    Duration: 4 seconds
   Data Read: 0 bytes
Data Written: 605245 bytes

With ServerA removed, it resulted in a slightly faster transfer speed but I feel is still abysmal. At least it's under 3 minutes transfer time but it will take too long with a few hundred gigabytes of small files (avg less than 100kb in size).

sent 8,177,725 bytes  received 1,989 bytes  50,963.95 bytes/sec
total size is 8,166,877  speedup is 1.00

real    2m40.876s
user    0m0.105s
sys     0m0.174s

- The operating system / glusterfs version: Ubuntu 22.04LTS Jammy / glusterfs 10.3

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

stale[bot] commented 9 months ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

bertysentry commented 8 months ago

Feeling bad for @firsttimegithubuser...

krmeljalen commented 7 months ago

@firsttimegithubuser reason you are seeing very poor performance is due to latency between servers. You will not be able to get better performance no matter what u do. In replicaset of 3, your write io operation is as fast as the slowest server and if you are flushing let's say 1kb of data, your rtt is 100ms, so you will be able to write 10x100ms = 1s so 10x1kb = 10kb/s. Redesign your setup, cause you are doing this fundamentally wrong.

krmeljalen commented 7 months ago

Reason u see better performance with just two nodes, is cause it only needs to wait confirmation from two nodes, not three anymore and that results in slight better performance... But still, latency is way too high for network attached storage. This is not how its suppose to be used.

milkbite commented 3 months ago

I'm having the same problem. I had to remove the disk from a customer's VM that was on mountpoint of Gluster storage and place it directly on the xfs disk. Apparently I'm suffering from latency on my worst replica 3 server (a SFF PC), a single gigabit network card. Recalculating...

mykaul commented 3 months ago

@firsttimegithubuser reason you are seeing very poor performance is due to latency between servers. You will not be able to get better performance no matter what u do. In replicaset of 3, your write io operation is as fast as the slowest server and if you are flushing let's say 1kb of data, your rtt is 100ms, so you will be able to write 10x100ms = 1s so 10x1kb = 10kb/s. Redesign your setup, cause you are doing this fundamentally wrong.

You may consider enabling compression, btw. It might help a bit.

milkbite commented 1 month ago

After removing the worst server, an SSF PC, I realized that my problem was with server2's HDD. I maintained a replica 2 volume using two PowerEdge T130 and replaced the HDD with an Intel SSD. The Gluster Volume, now composed of 2 x SSD Bricks, now has a recording rate above 35MBps.