Closed matclayton closed 4 years ago
As per strace these latencies are indeed stemming from XFS
pranith.karampuri@PP-CNPZ5M2 ~/D/strace> grep lgetxattr strace.15* | awk '{print $NF}' | cut -f2 -d'<' | cut -f1 -d'>' | sort -n | tail -5
0.082245
0.084232
0.095902
0.109003
0.216018
pranith.karampuri@PP-CNPZ5M2 ~/D/strace> grep lstat strace.15* | awk '{print $NF}' | cut -f2 -d'<' | cut -f1 -d'>' | sort -n | tail -5
0.108165
0.111372
0.129129
0.153885
0.350215
We're also seeing the following distribution on the XFS layer
`disk = 'sdb' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 18 | | 4 -> 7 : 48 | | 8 -> 15 : 123 | | 16 -> 31 : 111 | | 32 -> 63 : 114 | | 64 -> 127 : 2940 |* | 128 -> 255 : 6369 |*** | 256 -> 511 : 7613 |****| 512 -> 1023 : 917 | | 1024 -> 2047 : 1128 |* | 2048 -> 4095 : 421 |* | 4096 -> 8191 : 370 | | 8192 -> 16383 : 2059 |** | 16384 -> 32767 : 2615 |***** | 32768 -> 65535 : 108 | | 65536 -> 131071 : 24 | | 131072 -> 262143 : 49 | | 262144 -> 524287 : 12 | |
disk = 'sdc' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 25 | | 4 -> 7 : 50 | | 8 -> 15 : 16 | | 16 -> 31 : 3 | | 32 -> 63 : 78 | | 64 -> 127 : 2784 |*** | 128 -> 255 : 7109 |****| 256 -> 511 : 6120 |** | 512 -> 1023 : 1474 |**** | 1024 -> 2047 : 1715 |* | 2048 -> 4095 : 871 |** | 4096 -> 8191 : 504 | | 8192 -> 16383 : 2558 |** | 16384 -> 32767 : 3485 |*** | 32768 -> 65535 : 161 | | 65536 -> 131071 : 79 | | 131072 -> 262143 : 61 | | 262144 -> 524287 : 46 | |`
@matclayton ,
did you align your HW raid with the PVs, VGs and XFS properly ? The proper alignment is described in chapter 19: https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/html/administration_guide/chap-configuring_red_hat_storage_for_enhancing_performance
@hunter86bg I debugged this along with @matclayton. After debugging this for 2 days, we think this issue is happening because of https://bugzilla.redhat.com/show_bug.cgi?id=1676479. @matclayton will confirm after monitoring their cluster for a while.
@pranithk Thank you for the help on this, it was a pain to track down, I can confirm that things appear to still be stable on the read path. We've had some complaints about writes, but have yet to confirm these. I suspect its a seperate issue, or delayed reports coming in, we'll continue to monitor.
To confirm the performance issue was resolved by running
gluster volume set <volname> performance.io-cache off gluster volume set <volname> performance.read-ahead off
and since then it has been plain sailing. I'll close this issue, as it doesn't represent the problem we observed.
strace.tar.gz
Description of problem: We have nginx backed by glusterfs, we've started to see socketbacklog issues on nginx and performance problems serving files, under further investigation we believe this is due to glusterfs and probably the backend bricks. Having talked to Pranith on slack I'm opening this issue. we observed several bricks taking 250mS+ to do LOOKUP and READ filesystem ops, attached is an strace of one of the backend bricks.
The backend bricks are running XFS on LVM2 on a RAID6 array using a Hardware controller + SSD cache.
The architecture is Nginx (via nginx_vod) -> glusterfs_client fuse mount -> gluster
The exact command to reproduce the issue: Observed latency from nginx
The full output of the command that failed: