gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.72k stars 1.08k forks source link

[bug:1478411] Directory listings on fuse mount are very slow due to small number of getdents() entries #910

Closed gluster-ant closed 2 years ago

gluster-ant commented 4 years ago

URL: https://bugzilla.redhat.com/1478411 Creator: nh2-redhatbugzilla at deditus.de Time: 20170804T13:56:14

I have a GlusterFS 3.10 volume and mounted it with the fuse mount (mount -t glusterfs), both on Linux.

On it I have a directory with 1 million files in it.

It takes very long to find /that/directory.

Using strace, I believe I discovered (at least part of) the reason:

1501854600.235524 getdents(4, / 20 entries /, 131072) = 1048 1501854600.235727 getdents(4, / 20 entries /, 131072) = 1032 1501854600.235922 getdents(4, / 20 entries /, 131072) = 1032

Despite find issuing getdents() with a large buffer size of 128K, glusterfs always only fills in 20 directory entries.

Each of those takes a network roundtrip (seemingly).

I also strace'd the brick on the server, where everything seems fine: There getdents() returns typically 631 entries, filling the 32KB buffer which the brick implementation uses for getdents().

If the find could also do ~631 per call, my directory listing would probably be 30x faster!

So it seems like something in gluster or fuse caps the number of getdents results per call to roughly 20.

What could that be?

gluster-ant commented 4 years ago

Time: 20170804T14:02:07 nh2-redhatbugzilla at deditus.de commented: Here's some more info that might hint at what the problem is:

Using the example program for getdents() from http://man7.org/linux/man-pages/man2/getdents.2.html and running it on my directory, I got this output (file names blacked out with "a"):

getdents(3, / 16 entries /, 10240) = 1552 --------------- nread=1552 --------------- inode# file type d_reclen d_off d_name -6047993476399939220 regular 88 17156 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -8239326137575567467 regular 88 17176 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -9058837543795989278 regular 112 17202 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -7474310353771725673 regular 112 17228 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -8618906312059539401 regular 112 17254 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -7247259159244687795 regular 112 17280 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -5665523655409565673 regular 88 17300 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -9046493272173795318 regular 88 17320 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -7953905749837767518 regular 88 17340 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -5289646910623071030 regular 88 17360 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -6314634794173123334 regular 88 17380 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -7954285670050863136 regular 88 17412 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -7849401699957688376 regular 88 17432 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -5006798607229018939 regular 88 17452 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -8323485281078848697 regular 112 17478 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -8119158990388255908 regular 112 17504 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa getdents(3, / 20 entries /, 10240) = 1056 --------------- nread=1056 --------------- inode# file type d_reclen d_off d_name -8208236586179861811 regular 48 17874 aaaaaaaaaaaaaaaaaaaaaaaaaa -5119236985543845211 regular 56 17884 aaaaaaaaaaaaaaaaaaaaaaaaaaaaa -5836971644108853015 regular 48 17894 aaaaaaaaaaaaaaaaaaaaaaaaaa -9155148501485991780 regular 56 17904 aaaaaaaaaaaaaaaaaaaaaaaaaaaaa -8305938675436910138 regular 56 17916 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -5221102094207784962 regular 56 17928 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -8599523819072935976 regular 48 17938 aaaaaaaaaaaaaaaaaaaaaaaaaa -5829978250186564000 regular 56 17948 aaaaaaaaaaaaaaaaaaaaaaaaaaaaa -5911118020253503871 regular 48 17958 aaaaaaaaaaaaaaaaaaaaaaaaaa -6764000214102234557 regular 56 17968 aaaaaaaaaaaaaaaaaaaaaaaaaaaaa -7204082494066266145 regular 56 17980 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -4637505561517422757 regular 48 17990 aaaaaaaaaaaaaaaaaaaaaaaaaa -9108705813292338787 regular 56 18000 aaaaaaaaaaaaaaaaaaaaaaaaaaaaa -6331907578899300543 regular 48 18010 aaaaaaaaaaaaaaaaaaaaaaaaaa -6095357471175923268 regular 56 18020 aaaaaaaaaaaaaaaaaaaaaaaaaaaaa -6954382210669410793 regular 56 18032 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -4974133016612012201 regular 48 18042 aaaaaaaaaaaaaaaaaaaaaaaaaa -5903271582479185642 regular 56 18052 aaaaaaaaaaaaaaaaaaaaaaaaaaaaa -6924142753799783732 regular 56 18064 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -6781216297939192739 regular 48 18074 aaaaaaaaaaaaaaaaaaaaaaaaaa

It seems like when the file names are longer, (first block) getdents() returns less results -- 16 in the above case instead of the usual 20.

So I wonder if there's some fuse-related buffer that gets filled that results getdents() returning so few entries, and whether I can adjust it somehow.

gluster-ant commented 4 years ago

Time: 20170804T14:39:53 nh2-redhatbugzilla at deditus.de commented: And some more patterns I observed stracing the glusterfs fuse process, using -e 'read,writev':

[pid 18266] 1501856999.667820 read(10, "\27\3\3\0\34", 5) = 5 [pid 18266] 1501856999.668085 read(10, "\357U>\245\325n[t\3\277hq!Z\303\32\247\334\336\327N\311\317s\252\267\2\2", 28) = 28 [pid 18266] 1501856999.668317 read(10, "\27\3\3\0000", 5) = 5 [pid 18266] 1501856999.668411 read(10, "\357U>\245\325n[u8\340\250\214\305\7&/\331=\320\214\326\340\227\16\225@c\252\307\213\211V"..., 48) = 48 [pid 18266] 1501856999.668549 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501856999.668597 read(10, "\357U>\245\325n[v\232\225\22/Jk\237\212\363b\215\212S\255\262K\227\347\6\275V-&E"..., 16408) = 16408 [pid 18266] 1501856999.668669 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501856999.668719 read(10, "\357U>\245\325n[wz\237\v\377\252\236'\356\265\37Z\341\241_m\341\2612\346+Dm\224\233"..., 16408) = 16408 [pid 18266] 1501856999.668810 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501856999.668862 read(10, "\357U>\245\325n[x\226\222\222\274\275\332D\304\3271\335M\340\300wq\210\200suU\372\326\17"..., 16408) = 16408 [pid 18266] 1501856999.668941 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501856999.668988 read(10, "\357U>\245\325n[y\216i$\276\237\vA1\33\31:\312\257g\323\221\227\r^\21R/\3713"..., 16408) = 16408 [pid 18266] 1501856999.669093 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501856999.669140 read(10, "\357U>\245\325n[z\205\224\361D\225V%\t\0tk\274K\3\2530U\202\311\222A\335G\266"..., 16408) = 16408 [pid 18266] 1501856999.669216 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501856999.669265 read(10, "\357U>\245\325n[{\204\276\253\272g\354\376\207hPe\22\300\3771\30\313\336,\2729pgn"..., 16408) = 16408 [pid 18266] 1501856999.669345 read(10, "\27\3\3:L", 5) = 5 [pid 18266] 1501856999.669392 read(10, "\357U>\245\325n[|4mQ\334\227\4\206\274 \273E<?mb\334\255\210Q/\350Z\351w"..., 14924) = 14924 [pid 18266] 1501856999.673053 writev(8, [{iov_base="p\16\0\0\0\0\0\0\333B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.673469 writev(8, [{iov_base="p\16\0\0\0\0\0\0\334B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.673802 writev(8, [{iov_base="p\16\0\0\0\0\0\0\335B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.674173 writev(8, [{iov_base="p\16\0\0\0\0\0\0\336B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.674528 writev(8, [{iov_base="p\16\0\0\0\0\0\0\337B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.674873 writev(8, [{iov_base="p\16\0\0\0\0\0\0\340B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.675237 writev(8, [{iov_base="p\16\0\0\0\0\0\0\341B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.675539 writev(8, [{iov_base="p\16\0\0\0\0\0\0\342B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.675887 writev(8, [{iov_base="p\16\0\0\0\0\0\0\343B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.676248 writev(8, [{iov_base="p\16\0\0\0\0\0\0\344B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.676576 writev(8, [{iov_base="p\16\0\0\0\0\0\0\345B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.676893 writev(8, [{iov_base="p\16\0\0\0\0\0\0\346B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.677159 writev(8, [{iov_base="p\16\0\0\0\0\0\0\347B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.677254 writev(8, [{iov_base="p\16\0\0\0\0\0\0\350B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.677344 writev(8, [{iov_base="p\16\0\0\0\0\0\0\351B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.677436 writev(8, [{iov_base="p\16\0\0\0\0\0\0\352B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.677533 writev(8, [{iov_base="p\16\0\0\0\0\0\0\353B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.677633 writev(8, [{iov_base="p\16\0\0\0\0\0\0\354B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.677730 writev(8, [{iov_base="p\16\0\0\0\0\0\0\355B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.677826 writev(8, [{iov_base="p\16\0\0\0\0\0\0\356B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.677940 writev(8, [{iov_base="p\16\0\0\0\0\0\0\357B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.678049 writev(8, [{iov_base="p\16\0\0\0\0\0\0\360B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.678152 writev(8, [{iov_base="p\16\0\0\0\0\0\0\361B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.678244 writev(8, [{iov_base="p\16\0\0\0\0\0\0\362B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.678338 writev(8, [{iov_base="p\16\0\0\0\0\0\0\363B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.678432 writev(8, [{iov_base="p\16\0\0\0\0\0\0\364B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.678532 writev(8, [{iov_base="p\16\0\0\0\0\0\0\365B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.678630 writev(8, [{iov_base="p\16\0\0\0\0\0\0\366B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.678726 writev(8, [{iov_base="p\16\0\0\0\0\0\0\367B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.678817 writev(8, [{iov_base="p\16\0\0\0\0\0\0\370B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.678912 writev(8, [{iov_base="p\16\0\0\0\0\0\0\371B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.678999 writev(8, [{iov_base="p\16\0\0\0\0\0\0\372B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.679099 writev(8, [{iov_base="p\16\0\0\0\0\0\0\373B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iovlen=3680}], 2) = 3696 [pid 18266] 1501856999.936511 read(10, "\27\3\3\0\34", 5) = 5 [pid 18266] 1501856999.936797 read(10, "\357U>\245\325n[}\343\35\272\266C\323\r\226\362\275\372\355\1\275\367\177\221]\341", 28) = 28 [pid 18266] 1501856999.937045 read(10, "\27\3\3\0000", 5) = 5 [pid 18266] 1501856999.937188 read(10, "\357U>\245\325n[~]l\37\270\336 L\311~~p\t\260\200\242\275~\331%\310\26UX\210"..., 48) = 48 [pid 18266] 1501856999.937291 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501856999.937342 read(10, "\357U>\245\325n[\177\0U\331\214\236\344cy\10\276\266\322\3447h\2\2668\347\266\20\6JM"..., 16408) = 16408 [pid 18266] 1501856999.937401 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501856999.937436 read(10, "\357U>\245\325n[\200\250\372\274)\2\307\227\33_\221\3639\222\2059aI\340<r~\306rb"..., 16408) = 16408 [pid 18266] 1501856999.937498 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501856999.937533 read(10, "\357U>\245\325n[\2018F\0365\336D(O\211\3370!\v\235\271\0275v\231-\2339^\253"..., 16408) = 16408 [pid 18266] 1501856999.937594 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501856999.937628 read(10, "\357U>\245\325n[\202\353`\344\4\17-/\372\204[\277A\251\310n\2250\32S\276!|\361\333"..., 16408) = 16408 [pid 18266] 1501856999.937680 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501856999.937715 read(10, "\357U>\245\325n[\203qbev\336\305\3750O\307\221U\367 @\262\202[p1\347\231\305\2"..., 16408) = 16408 [pid 18266] 1501856999.937770 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501856999.937819 read(10, "\357U>\245\325n[\204\242\245Y\177\302\251\316u\301\354zR>3{D\6gc\365\302\277un"..., 16408) = 16408 [pid 18266] 1501856999.937871 read(10, "\27\3\3:T", 5) = 5 [pid 18266] 1501856999.937905 read(10, "\357U>\245\325n[\205y}X\3221\325\21\275\321\330\371\353\310\362\21\36}Q\352\203\321\350\1\373"..., 14932) = 14932 [pid 18266] 1501856999.940995 writev(8, [{iov_base="p\16\0\0\0\0\0\0\374B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.941429 writev(8, [{iov_base="p\16\0\0\0\0\0\0\375B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.941725 writev(8, [{iov_base="p\16\0\0\0\0\0\0\376B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.941968 writev(8, [{iov_base="p\16\0\0\0\0\0\0\377B\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.942312 writev(8, [{iov_base="p\16\0\0\0\0\0\0\0C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.942604 writev(8, [{iov_base="p\16\0\0\0\0\0\0\1C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.942850 writev(8, [{iov_base="p\16\0\0\0\0\0\0\2C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.943087 writev(8, [{iov_base="p\16\0\0\0\0\0\0\3C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.943352 writev(8, [{iov_base="p\16\0\0\0\0\0\0\4C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.943618 writev(8, [{iov_base="p\16\0\0\0\0\0\0\5C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.943896 writev(8, [{iov_base="p\16\0\0\0\0\0\0\6C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.944181 writev(8, [{iov_base="p\16\0\0\0\0\0\0\7C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.944434 writev(8, [{iov_base="p\16\0\0\0\0\0\0\10C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.944665 writev(8, [{iov_base="p\16\0\0\0\0\0\0\tC\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.944917 writev(8, [{iov_base="p\16\0\0\0\0\0\0\nC\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.945182 writev(8, [{iov_base="p\16\0\0\0\0\0\0\vC\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.945445 writev(8, [{iov_base="p\16\0\0\0\0\0\0\fC\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.945695 writev(8, [{iov_base="p\16\0\0\0\0\0\0\rC\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.945962 writev(8, [{iov_base="p\16\0\0\0\0\0\0\16C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.946222 writev(8, [{iov_base="p\16\0\0\0\0\0\0\17C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.946467 writev(8, [{iov_base="p\16\0\0\0\0\0\0\20C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.946713 writev(8, [{iov_base="p\16\0\0\0\0\0\0\21C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.946955 writev(8, [{iov_base="p\16\0\0\0\0\0\0\22C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.947218 writev(8, [{iov_base="p\16\0\0\0\0\0\0\23C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.947446 writev(8, [{iov_base="p\16\0\0\0\0\0\0\24C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.947674 writev(8, [{iov_base="p\16\0\0\0\0\0\0\25C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.947904 writev(8, [{iov_base="p\16\0\0\0\0\0\0\26C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.948156 writev(8, [{iov_base="p\16\0\0\0\0\0\0\27C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.948446 writev(8, [{iov_base="p\16\0\0\0\0\0\0\30C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.948694 writev(8, [{iov_base="p\16\0\0\0\0\0\0\31C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.948943 writev(8, [{iov_base="p\16\0\0\0\0\0\0\32C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.949196 writev(8, [{iov_base="p\16\0\0\0\0\0\0\33C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501856999.949442 writev(8, [{iov_base="p\16\0\0\0\0\0\0\34C\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696

The read() syscalls that get the file names over from the gluster server happen in bursts, and the bursts are about 300 ms apart. (I'm using SSL so the file names aren't visible in the read calls.) They happen in 16 KB buffers (but here's some strange read()s of size 5 in between, not sure what those are for), and looking at the timings they don't seem to be blocked on network roundtrips, so that's good.

But the roughly 4K sized writev() syscalls each have a network roundtrip in betweeen. That seems strange to me, given that we've just read() all the data before in a batch. What's going on there?

gluster-ant commented 4 years ago

Time: 20170804T14:46:38 nh2-redhatbugzilla at deditus.de commented: And for completeness, here's the unfiltered strace that shows that in between the writev()s there's only futex() and readv(). The readv() uses the same FD as the writev(), so that's probably localhost fuse communication.

[pid 18266] 1501857629.747928 read(10, "\357U>\245\325n\200#\360\22/9\370\205lL\322\226gk\233\255\2633[\10R\34j\334,'"..., 16408) = 16408 [pid 18266] 1501857629.747975 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501857629.748012 read(10, "\357U>\245\325n\200$^\213\37\351\tcZ\273\352\5k_\311'\345s\261\21\37:%\364\315\227"..., 16408) = 16408 [pid 18266] 1501857629.748060 read(10, "\27\3\3@\30", 5) = 5 [pid 18266] 1501857629.748091 read(10, "\357U>\245\325n\200%\211I\353\304\252\260\256\250t\257\247Z\5\215\7q3\232\236\217\277\373Y-"..., 16408) = 16408 [pid 18266] 1501857629.748137 read(10, "\27\3\3:P", 5) = 5 [pid 18266] 1501857629.748167 read(10, "\357U>\245\325n\200&\3363W\353\313\273b\344/\20\230\305\265#7\30782\371e\221\365\221\17"..., 14928) = 14928 [pid 18266] 1501857629.750987 writev(8, [{iov_base="p\16\0\0\0\0\0\0\331\311\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18269] 1501857629.751199 <... readv resumed> [{iov_base="P\0\0\0,\0\0\0\332\311\0\0\0\0\0\0\0.\0d\f\177\0\0\0\0\0\0\0\0\0\0"..., iov_len=80}, {iov_base="", iov_len=131072}], 2) = 80 [pid 18269] 1501857629.751321 futex(0x7f0d0c036ab4, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18266] 1501857629.751346 poll([{fd=13, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=10, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, -1 <unfinished ...> [pid 18269] 1501857629.751373 <... futex resumed> ) = 1 [pid 18224] 1501857629.751387 <... futex resumed> ) = 0 [pid 18269] 1501857629.751398 readv(8, <unfinished ...> [pid 18224] 1501857629.751410 futex(0x7f0d0c036a60, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 18224] 1501857629.751459 writev(8, [{iov_base="p\16\0\0\0\0\0\0\332\311\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696 [pid 18224] 1501857629.751536 futex(0x7f0d0c036ab0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1501857749, tv_nsec=0}, 0xffffffff <unfinished ...> [pid 18269] 1501857629.751619 <... readv resumed> [{iov_base="P\0\0\0,\0\0\0\333\311\0\0\0\0\0\0\0.\0d\f\177\0\0\0\0\0\0\0\0\0\0"..., iov_len=80}, {iov_base="", iov_len=131072}], 2) = 80 [pid 18269] 1501857629.751672 futex(0x7f0d0c036ab0, FUTEX_WAKE_PRIVATE, 1) = 1 [pid 18224] 1501857629.751706 <... futex resumed> ) = 0 [pid 18269] 1501857629.751715 readv(8, <unfinished ...> [pid 18224] 1501857629.751726 futex(0x7f0d0c036a60, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 18224] 1501857629.751772 writev(8, [{iov_base="p\16\0\0\0\0\0\0\333\311\0\0\0\0\0\0", iov_len=16}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=3680}], 2) = 3696

gluster-ant commented 4 years ago

Time: 20170906T04:00:28 rgowdapp at redhat commented: getdents returning small number of entries, looks like a duplicate of bz 1356453.

gluster-ant commented 4 years ago

Time: 20170906T06:07:11 csaba at redhat commented:

This bug has been marked as a duplicate of bug 1356453

gluster-ant commented 4 years ago

Time: 20170906T23:01:18 nh2-redhatbugzilla at deditus.de commented: This bug is NOT fixed with Gluster 3.12, I just tried it:

$ gluster --version glusterfs 3.12.0

$ for x in seq 1 1000; do echo $x; touch /myglustermountdir/$x; done

$ strace find /myglustermountdir ... getdents(4, / 24 entries /, 131072) = 576 getdents(4, / 23 entries /, 131072) = 552 getdents(4, / 23 entries /, 131072) = 552 getdents(4, / 23 entries /, 131072) = 552 ...

gluster-ant commented 4 years ago

Time: 20170915T07:38:42 pgurusid at redhat commented: (In reply to nh2 from comment #6)

This bug is NOT fixed with Gluster 3.12, I just tried it:

$ gluster --version glusterfs 3.12.0

$ for x in seq 1 1000; do echo $x; touch /myglustermountdir/$x; done

$ strace find /myglustermountdir ... getdents(4, / 24 entries /, 131072) = 576 getdents(4, / 23 entries /, 131072) = 552 getdents(4, / 23 entries /, 131072) = 552 getdents(4, / 23 entries /, 131072) = 552 ...

Thanks for the data. So here is my analysis and possible solution:

  1. You are right, even after the BZ bug 1356453 is fixed, the getdents only fetches 20-26 entries.

  2. The reason for the same is FUSE kernel module behaviour: FUSE, irrespective of what buffer getdents sends, fuse limits it to PAGE_SIZE 4k(4096) and then sends it to gluster. Thus because of fuse, gluster getdent buffer is limited to 4k. But when you assign the gedents buffer size to 4K and run the same on plain XFS, there will be ~128 entries returned, which is much higher than 20-26 returned by FUSE/Gluster. The reason for this is, the structure fuse uses to getdents is (struct fuse_direntplus) and the buffer size mentioned in application is for (struct linux_dirent). The sizeof(struct fuse_direntplus) is ~158 and sizeof(struct linux_dirent) is ~24, thus the number od dentries that can be accommodated by fuse getdents buffer in 4K is ~5 times less than that can be accommodated by application getdents.

    Thus we are limited by the FUSE buffer size and struture type.

  3. But the saving grace is, each getdent call is not a network round trip like you mentioned in initial comment. This is because of readdir-ahead caching in gluster. Gluster basically pre-fetches the dentries and keeps it in cache. The pre-fetching buf size is 128K and cache buffer size is 10MB default. This 99% of the getdents gets served from this cache and doesn't result in network round trip. So, even if FUSE fixes getdents buffer thing, you may not see 30X improvement as you expected, but some percentage only, as it saves lots of getdents call and kernel to user context switches.

I would suggest we close this bug on gluster, and raise it in FUSE kernel?

An alternative for this, we can implement glfs_getdents in libgfapi(system call equivalent of GlusterFS and get rid of the FUSE interference) which can be integrated to other applications or we can write a wrapper and provide it as a command.

gluster-ant commented 4 years ago

Time: 20170915T13:53:27 nh2-redhatbugzilla at deditus.de commented: Thanks for your very insightful reply!

I think we should definitely bring this up as a FUSE issue, I can imagine gluster and other FUSE based software would have much better performance if they weren't limited to 4 KB per syscall.

Do you know where this PAGE_SIZE limit is implemented, or would you even be able to file this issue? I know very little about fuse internals and don't feel prepared to make a high quality issue report on this topic with them yet.

each getdent call is not a network round trip like you mentioned in initial comment

You are right, for a test I just increased latency tenfold with tc qdisc add dev eth0 root netem delay 2ms and the time of getdents() calls stayed the same (with an occasional, much slower getdents() call when it had to fetch new data).

I assumed it was a network roundtrip because the time spent per syscall is roughly my LAN network roundtrip (0.2 ms), but that just happened to be how slow the syscalls were independent of the network.

I would suggest we close this bug on gluster, and raise it in FUSE kernel?

Personally I would prefer if we could keep it open until getdents() performance is fixed; from a Gluster user's perspective, it is a Gluster problem that directory listings are slow, and the fact that FUSE plays a role in it is an implementation detail. Also, as you say, FUSE allowing to use larger buffer sizes may not be the only thing needed to improve the performance.

I did a couple more measurements that suggest that there are still large integer factors unexplained:

Using strace -f -T -e getdents on the example program from man getdents (http://man7.org/linux/man-pages/man2/getdents.2.html) with BUF_SIZE changed from 1024 to 10240 and 131072 (128K), running it against XFS and the fuse mount like:

$ gcc getdents-listdir.c -O2 -o listdir $ strace -f -T -e getdents ./listdir /data/brick/.../mydir > /dev/null $ strace -f -T -e getdents ./listdir /fuse-mount/.../mydir > /dev/null

With strace -T, the values in is the time spent in the syscall.

Results for BUF_SIZE = 10240:

gluster fuse: getdents(3, / 20 entries /, 10240) = 1040 <0.000199> XFS: getdents(3, / 195 entries /, 10240) = 10240 <0.000054>

Results for BUF_SIZE = 131072:

gluster fuse: getdents(3, / 20 entries /, 131072) = 1040 <0.000199> XFS: getdents(3, / 2498 entries /, 131072) = 131072 <0.000620>

This shows that, almost independent of BUF_SIZE, computing bytes per time,

That's almost a 40x performance difference (and as you say, no networking is involved).

Even when taking into account the mentioned 5x space overhead of fuse_direntplus vs linux_dirent, and assuming that 5x space overhead means 5x increased wall time, there's a factor 8x being lost.

Why might an individual getdents() call be that much slower on fuse?

gluster-ant commented 4 years ago

Time: 20170917T00:01:08 nh2-redhatbugzilla at deditus.de commented: I have now written a kernel patch for fuse that makes the readdir use 32 pages (128 KiB) instead of 1 page (4 KiB):

https://github.com/nh2/linux/commit/bedbf74a1a6d2958e719ed77a6a4f0baac79deab

(on my branch https://github.com/nh2/linux/compare/v4.9-fuse-large-readdir).

For sshfs (another FUSE program) this brings an immediate improvement:

With 1M files prepared like this:

mkdir sshfsmount-1M sshfsdir-1M cd sshfsdir-1M seq 1 1000000 | xargs touch cd .. sshfs localhost:sshfsdir-1M sshfsmount-1M

and my example program adapted from the getdents() man page (code here: https://gist.github.com/nh2/6ebd9d5befe130fd6faacd1024ead3d7) I get an immediate improvement for ./getdents-silent sshfsmount-1M/:

Without kernel patch (1 page): 0.267021 With 32-page kernel patch: 0.179651

(You have to run twice in quick succession to get these results, because sshfs discards its cache very quickly and we want to measure FUSE syscall overhead, not it fetching the data over sshfs. If you wait too long it may take a minute for a full fetch.)

That's a 1.5x speedup for the entire program run; but because sshfs does some initialisation work, we should look at the actual strace outputs instead:

Without kernel patch (1 page):

strace -tttT -f -e getdents ./listdir-silent sshfsmount-1M/

1505605898.572720 getdents(3, / 128 entries /, 131072) = 4064 <47.414827> 1505605945.992171 getdents(3, / 128 entries /, 131072) = 4096 <0.000036> 1505605945.992311 getdents(3, / 128 entries /, 131072) = 4096 <0.000031>

With 32-page kernel patch:

strace -tttT -f -e getdents ./listdir-silent sshfsmount-1M/

1505605890.435250 getdents(3, / 4096 entries /, 131072) = 130776 <60.054614> 1505605950.494406 getdents(3, / 4096 entries /, 131072) = 130736 <0.000153> 1505605950.494693 getdents(3, / 4096 entries /, 131072) = 130728 <0.000157>

Here you can see first the initial fetching work (depends on what's in SSHFS's cahe at that point), and then the real syscalls.

Using 32 pages has increased the bytes per syscall by 32x, and the time by ~5x, so it's approximately:

6x faster

So landing such a patch seems beneficial to remove syscall overhead.

It would certainly help SSHFS cached directory listings.

Next, back to gluster.

gluster-ant commented 4 years ago

Time: 20170917T00:38:24 csaba at redhat commented: Nice work! I suggest to submit it to the fuse-devel ML.


As of glusterfs, another possible issue came up. The buffer that a given xlator fills with dirents is fixed size (during the handling of a given readdir[p] fop). However, various xlators operate with various dirent flavors (eg. posix with system dirents, fuse wuth fuse dirents) so when the dirent holding buffer is passed around between xlators, the converted dirents won't fill optimally the next xlator's buffer. Practically, the fuse dirent is bigger, so not all of the entries received from the underlying xlator will fit in the dirent buffer of fuse after conversion. The rest will be discarded, and will have to be read again on next getdents call.

  1. But the saving grace is, each getdent call is not a network round trip like you > mentioned in initial comment. This is because of readdir-ahead caching in gluster.

Alas, the above described phenomenon defeats this too: because of the re-read constraint the dir offsets of subsequent getdents' won't be monotonic, upon which readdir-ahead deactivates itself.

Whether / to what rate does this occur might depend on the configuration.

@nh2: Therefore we'd like you to kindly ask to share your volume info and TRACE level logs, from mount on to observing the small getdents() calls.

gluster-ant commented 4 years ago

Time: 20170917T01:57:39 nh2-redhatbugzilla at deditus.de commented: So for SSHFS we got an improvement from ~113 MB/s to ~854 MB/s getdents() performance.

Now running a similar test on gluster, for time's sake with only 10K files.

My gluster config:

Volume Name: myvol Type: Replicate Volume ID: ... Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.0.0.1:/data/brick Brick2: 10.0.0.2:/data/brick Brick3: 10.0.0.3:/data/brick Options Reconfigured: nfs.disable: on transport.address-family: inet client.ssl: on server.ssl: on storage.linux-aio: on performance.io-thread-count: 64 performance.readdir-ahead: on server.event-threads: 32 client.event-threads: 32 server.outstanding-rpc-limit: 64 cluster.lookup-unhashed: auto performance.flush-behind: on performance.strict-write-ordering: off performance.high-prio-threads: 64 performance.normal-prio-threads: 64 performance.low-prio-threads: 64 performance.write-behind-window-size: 10MB cluster.ensure-durability: on performance.lazy-open: yes cluster.use-compound-fops: off performance.open-behind: on features.cache-invalidation: off performance.quick-read: off performance.read-ahead: off performance.stat-prefetch: off changelog.rollover-time: 1 geo-replication.indexing: on geo-replication.ignore-pid-check: on changelog.changelog: on

Files created like

touch /glustermount/largenames-10k/1234567890123456789012345678901234567890-file{1..10000}

Also doing quickly repeated runs to obtain these numbers. All of these benchmarks were taken on an AWS t2.small medium machine (these machines are generally bad for benchmarking, but I've checked that the results are consistent; and importantly the numbers are relative to each other).

Without kernel patch (1 page):

strace -w -c -f -e getdents ./listdir-silent /glustermount/largenames-10k/

% time seconds usecs/call calls errors syscall


100.00 0.213868 384 557 getdents

With 32-page kernel patch:

strace -w -c -f -e getdents ./listdir-silent /glustermount/largenames-10k

% time seconds usecs/call calls errors syscall


100.00 0.211732 11763 18 getdents

Almost no improvement!

Let's look at the individual getdents() invocations:

Without kernel patch (1 page):

strace -Tttt -f -e getdents ./listdir-silent /glustermount/largenames-10k/

1505608150.612771 getdents(3, / 19 entries /, 131072) = 1272 <0.007789> -- notice large initial time above -- 1505608150.620606 getdents(3, / 18 entries /, 131072) = 1296 <0.000117> 1505608150.620738 getdents(3, / 18 entries /, 131072) = 1296 <0.000090> 1505608150.620842 getdents(3, / 18 entries /, 131072) = 1296 <0.000084> -- 30 similar lines omitted -- 1505608150.623666 getdents(3, / 18 entries /, 131072) = 1296 <0.010920> -- notice large time above -- 1505608150.634608 getdents(3, / 18 entries /, 131072) = 1296 <0.000079> 1505608150.634701 getdents(3, / 18 entries /, 131072) = 1296 <0.000083>

With 32-page kernel patch:

strace -Tttt -f -e getdents ./listdir-silent /glustermount/largenames-10k/

1505608076.391872 getdents(3, / 604 entries /, 131072) = 43392 <0.022552> 1505608076.414921 getdents(3, / 602 entries /, 131072) = 43344 <0.001411> 1505608076.416477 getdents(3, / 601 entries /, 131072) = 43272 <0.025793> 1505608076.442688 getdents(3, / 599 entries /, 131072) = 43128 <0.000730> 1505608076.443521 getdents(3, / 601 entries /, 131072) = 43272 <0.024431> 1505608076.468360 getdents(3, / 599 entries /, 131072) = 43128 <0.001108> -- 12 similar lines omitted --

Observations:

1) Every second getdents() in the 32-page case is super fast. I suspect that this is the effect of readdir-ahead: on (wasn't that also 128 KB, I read somewhere?). Same thing for the 1-page case, where only every ~30th getdents() hits the network to glusterfsd. 2) Even in the 32-page case, not the entire buffer provided to getdents() (131072 bytes) are filled, only approximately 1/3 is used. Same thing for the 1-page case, about 1/3 of the 4K fuse buffer is filled. When I first saw this I attributed it to the "fuse_direntplus vs linux_dirent" topic that Poornima mentioned, but SSHFS doesn't seem to have this problem and always fills its buffers, so maybe gluster is doing something strange here? 3) The performance is at a miserable ~1.7 MB/s in both cases, much worse than sshfs.

So I started investigating what it (the glusterfs FUSE mount process) is doing.

First thing I noticed: I was wrong when I said

and as you say, no networking is involved

after doing my test with tc: I had only tested that no inter-brick, or FUSE to other-brick networking was happening.

But what is certainly happening is localhost networking!

If I use

tc qdisc add dev lo root netem delay 200ms

thus making my ping localhost be 400 ms, then all getents() calls get significantly slower:

strace -Tttt -f -e getdents ./listdir-silent /mount/glustermount-10k

1505580070.060286 getdents(3, / 604 entries /, 131072) = 43392 <0.824180> 1505580070.885190 getdents(3, / 602 entries /, 131072) = 43344 <0.000909> 1505580070.886592 getdents(3, / 601 entries /, 131072) = 43272 <0.824842> 1505580071.711986 getdents(3, / 599 entries /, 131072) = 43128 <0.000693> 1505580071.713164 getdents(3, / 601 entries /, 131072) = 43272 <0.826009> 1505580072.539722 getdents(3, / 599 entries /, 131072) = 43128 <0.000702>

So there is networking happening, in my case it just happened over the lo interface instead of over eth0 because the machine on which I have the mount is also one of the 3-replica bricks!

A quick ping localhost shows that localhost latency is ~50 us; I find that is a lot (but apparently that's just how it is), and it's probably not very relevant here (at least the following strace doesn't indicate so on the userspace level).

So I ran strace against te glusterfs FUSE process:

strace -tttT -f -p THEPID

What I could see was a FUSE request/response loop: There is a readv() against /dev/fuse started, with which the process asks fuse for the next operation (e.g. a READDIR opcode); this readv() blocks as long as the FUSE kernel side is still waiting on us, the userspace process, to send it the result of the currently running operation. Once we send it (e.g. the contents of a readdir response) with a writev(), the readv() returns and gives us the next opcode to work on.

As a result, the fulfillment of each getdents() request is bracketed between a <... readv resumed> and a final writev(8, ...) syscall.

I have posted one of those brackets at

https://gist.github.com/nh2/163ffea5bdc16b3a509c4b262b1d382a

Each such getdents() fulfillment takes ~29 ms. I find that quite a lot just to fill in ~40000 Byte above, so I analysed what major time sinks are within these 29 ms in this strace output of the glusterfs FUSE mount process. You can find that analysis here:

https://gist.github.com/nh2/163ffea5bdc16b3a509c4b262b1d382a#gistcomment-2205164

Quite some CPU time is spent (again, seemingly way too much for providing 40000 Bytes), but also significant time is spent waiting on communication with the glusterfsd socket (which is FD 10), with poll() and apparently a blocking write().

(There's also some precisely sized reads, such as those 5-byte reads, for which I wonder if they could be combined with larger reads to reduce the amount of syscalls, but they are so fast compared to the rest that's going on that they don't matter for the current investigation).

As there's lots of waiting for glusterfsd in there, I started to strace glusterfsd instead, while running ./listdir-silent against the fuse mount.

That immediately revealed why getdents() on glusterfs is so much slower than on SSHFS (sorry long lines):

[pid 972] 1505589830.074176 lstat("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 <0.000007> [pid 972] 1505589830.074219 lgetxattr("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552", "tr"..., 0x7f5bf697e370, 255) = -1 ENODATA (No data available) <0.000008> [pid 972] 1505589830.074262 lgetxattr("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552", "tr"..., 0x7f5bf697e370, 255) = -1 ENODATA (No data available) <0.000007> [pid 972] 1505589830.074302 lgetxattr("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552", "tr"..., 0x7f5bf697e370, 255) = -1 ENODATA (No data available) <0.000006> [pid 972] 1505589830.074343 lgetxattr("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552", "sy"..., 0x7f5bf697e370, 255) = -1 ENODATA (No data available) <0.000007> [pid 972] 1505589830.074383 lgetxattr("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552", "sy"..., 0x7f5bf697e370, 255) = -1 ENODATA (No data available) <0.000007> [pid 972] 1505589830.074423 lgetxattr("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552", "tr"..., 0x7f5bf697e370, 255) = -1 ENODATA (No data available) <0.000007> [pid 972] 1505589830.074464 lstat("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7553", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 <0.000006> -- same lgetxattr() calls as above repeated --

Apparently when gluster tries to fulfil a getdents() FUSE request, each individual file returned by my XFS brick file system is stat()ed afterwards and also gets 6 lgetxattr() calls (all returning ENODATA).

I guess that's just part of how gluster works (e.g. to determine based on xattrs whether a file shall actually be shown in the output or not), but it wasn't obvious to me when I started debugging this, and it certainly makes the business much slower than just forwarding some underlying XFS getdents() results.

As mentioned in my Github gist from before, in contrast to stracing glusterfs, when stracing glusterfsd the strace overhead actually becomes very noticeable (the ./listdir-silent run becomes 5x slower when strace is attached), and the above -ttt times are not accurate and incorporate this strace overhead (you see the <...> times emitted by -T, roughly 6 us each, don't add up to the ~40 us steps -ttt timestamps).

So I switched to measure with perf instead, using e.g.

perf record -e 'syscalls:sys_*' -p 918

which had 6x overhead for an entire ./listdir-silent run over the 10k files, and then, for lower overhead, specific syscalls (since then I already knew what syscalls were going on, so I couldn't miss any in between),

perf record -e 'syscalls:sys_enter_newlstat' -e 'syscalls:sys_exit_newlstat' -e 'syscalls:sys_enter_lgetxattr' -e 'syscalls:sys_exit_lgetxattr' -p 918

perf script

which shows much less overhead in the profile:

26239.494406: syscalls:sys_enter_newlstat 26239.494408: syscalls:sys_exit_newlstat 26239.494409: syscalls:sys_enter_lgetxattr 26239.494411: syscalls:sys_exit_lgetxattr 26239.494412: syscalls:sys_enter_lgetxattr 26239.494413: syscalls:sys_exit_lgetxattr 26239.494414: syscalls:sys_enter_lgetxattr 26239.494416: syscalls:sys_exit_lgetxattr 26239.494417: syscalls:sys_enter_lgetxattr 26239.494418: syscalls:sys_exit_lgetxattr 26239.494419: syscalls:sys_enter_lgetxattr 26239.494420: syscalls:sys_exit_lgetxattr 26239.494421: syscalls:sys_enter_lgetxattr 26239.494423: syscalls:sys_exit_lgetxattr 26239.494424: syscalls:sys_enter_newlstat 26239.494426: syscalls:sys_exit_newlstat

and also increased total run time only by ~2x to ~0.5 seconds.

I suspect that the profile shows that the amount of syscalls gluster makes on each file (and the fact that it has to make per-file syscalls for getdents()) is somewhat problematic: Each file eats ~18 us, that's roughly 50K files per second, or 100K per second assuming 2x profiling overhead. Even if there was no networking involved at all, just stat()ing & getfattr()ing the 10K files would take 0.1 seconds, which is half of the 0.2 seconds that the entire ./listdir-silent takes.

I wonder if some future version of Gluster should not store all its metadata info in extended attributes, as putting it there requires lots of syscalls to retrieve it? I guess this is one reason for Gluster's low small-file and directory performance?

I suspect that if it had all this info in memory, or could obtain it via a few-system-calls-large-reads method (e.g. a single index file or even mmap), it could do better at this.

gluster-ant commented 4 years ago

Time: 20170917T01:59:41 nh2-redhatbugzilla at deditus.de commented: @Csaba: FYI the above post wasn't written in reply to yours, I was still typing more stuff down (and have one more post coming), will take a look at your reply immediately afterwards! :)

gluster-ant commented 4 years ago

Time: 20170917T02:02:18 nh2-redhatbugzilla at deditus.de commented: Another weird thing that I haven't demystified yet:

strace -wc -f -e getdents,lstat ls -1U /glustermount/largenames-10k/ > /dev/null

% time seconds usecs/call calls errors syscall


100.00 0.208856 11603 18 getdents


100.00 0.208856 18 total

strace -wc -f -e getdents,lstat ls -lU /glustermount/largenames-10k/ > /dev/null

% time seconds usecs/call calls errors syscall


74.99 0.168202 17 10001 lstat 25.01 0.056089 3116 18 getdents


100.00 0.224291 10019 total

Notice the difference ls -1 vs -l, the former stat()s all files and the latter doesn't.

For some weird reason, when I also stat() as above with -l, getdents() magically get faster!

It would make sense to me if the present of getdents() made stat()s faster (e.g. as part of some stat-prefetch, though I have that off according to my posted volume config), but not the other way around.

What might be going on here?

gluster-ant commented 4 years ago

Time: 20170917T02:10:44 nh2-redhatbugzilla at deditus.de commented: (In reply to Csaba Henk from comment #10)

@nh2: Therefore we'd like you to kindly ask to share your volume info and TRACE level logs, from mount on to observing the small getdents() calls.

@Csaba: OK, I have read your comment now.

My volume info is in the post further above, would you mind to shortly describe or link how I can set the log level to TRACE and which exact file(s) I should provide?

gluster-ant commented 4 years ago

Time: 20170918T06:20:28 pgurusid at redhat commented: I haven't yet gone through the comments in detail, thats a lot of analysis. Thanku.

Meanwhile, you mentioned stat and getxattrs are being performed, so we have stat-prefetch which caches the stat and xattrs of a file/dir. Please execute the following on your test system and see if you get improvements:

$ gluster vol set group metadata-cache Also please increase the value of network.inode-lru-limit to 200000 and check if it improves further. $ gluster vol set network.inode-lru-limit 200000

Also there are few patches(WIP), that were a result of debugging performance issues in readdir: https://review.gluster.org/#/c/17985/ https://review.gluster.org/#/c/18242/

I suggest to try these patches if possible and the metadata-cache to see if the performance reaches any satisfactory levels?

gluster-ant commented 4 years ago

Time: 20170918T08:17:43 rgowdapp at redhat commented: (In reply to nh2 from comment #14)

(In reply to Csaba Henk from comment #10)

@nh2: Therefore we'd like you to kindly ask to share your volume info and TRACE level logs, from mount on to observing the small getdents() calls.

@Csaba: OK, I have read your comment now.

My volume info is in the post further above, would you mind to shortly describe or link how I can set the log level to TRACE and which exact file(s) I should provide?

Please set following options to get logs at TRACE log-level:

gluster volume set diagnostics.client-log-level TRACE

gluster volume set diagnostics.brick-log-level TRACE

After this run your tests and attach logfiles of gluster mount and brick processes (usually found /var/log/glusterfs/ and /var/log/glusterfs/bricks)

gluster-ant commented 4 years ago

Time: 20170918T10:38:47 bugzilla-bot at gluster.org commented: REVIEW: https://review.gluster.org/18312 (cluster/dht: don't overfill the buffer in readdir(p)) posted (#1) for review on master by Raghavendra G (rgowdapp@redhat.com)

gluster-ant commented 4 years ago

Time: 20170918T10:42:18 bugzilla-bot at gluster.org commented: REVIEW: https://review.gluster.org/18312 (cluster/dht: don't overfill the buffer in readdir(p)) posted (#2) for review on master by Raghavendra G (rgowdapp@redhat.com)

gluster-ant commented 4 years ago

Time: 20170918T11:20:06 rgowdapp at redhat commented: (In reply to Worker Ant from comment #18)

REVIEW: https://review.gluster.org/18312 (cluster/dht: don't overfill the buffer in readdir(p)) posted (#2) for review on master by Raghavendra G (rgowdapp@redhat.com)

Note that since the volume configuration used here has just one dht subvolume (1x3), patch #18312 won't affect the test-case used in this bz. Since I happened to notice the issue in dht while looking into this bz, I used this bz.

With Hindsight, I should've realized this bug is not a duplicate of bz 1356453. But again, I couldn't find the volume info. As long as dht has a single subvolume, its unlikely that dht affects readdir(p) performance.

gluster-ant commented 4 years ago

Time: 20170919T09:44:32 rgowdapp at redhat commented:

With 32-page kernel patch:

As you've noted earlier, this will be equal to default readdir-ahead request size. Wondering what would be the behaviour if we increase the readdir-ahead request size by a factor of 10 times? You can use the following option

Option: performance.rda-request-size Default Value: 131072 Description: size of buffer in readdirp calls initiated by readdir-ahead

gluster volume set performance.rda-request-size 1310720

gluster-ant commented 4 years ago

Time: 20171121T00:26:56 rgowdapp at redhat commented: Another related bug: bz 1294936

gluster-ant commented 4 years ago

Time: 20171206T09:57:18 bugzilla-bot at gluster.org commented: REVIEW: https://review.gluster.org/18947 (cluster/dht: don't overfill the buffer in readdir(p)) posted (#1) for review on release-3.12 by Raghavendra G

gluster-ant commented 4 years ago

Time: 20171206T10:09:38 bugzilla-bot at gluster.org commented: REVIEW: https://review.gluster.org/18949 (cluster/dht: populate inode in dentry for single subvolume dht) posted (#1) for review on release-3.12 by Raghavendra G

gluster-ant commented 4 years ago

Time: 20171208T05:35:27 bugzilla-bot at gluster.org commented: COMMIT: https://review.gluster.org/18949 committed in release-3.12 by \"Raghavendra G\" rgowdapp@redhat.com with a commit message- cluster/dht: populate inode in dentry for single subvolume dht

... in readdirp response if dentry points to a directory inode. This is a special case where the entire layout is stored in one single subvolume and hence no need for lookup to construct the layout

Change-Id: I44fd951e2393ec9dac2af120469be47081a32185 BUG: 1492625 Signed-off-by: Raghavendra G rgowdapp@redhat.com

(cherry picked from commit 59d1cc720f52357f7a6f20bb630febc6a622c99c) Change-Id: I44fd951e2393ec9dac2af120469be47081a32185 BUG: 1478411 Signed-off-by: Raghavendra G rgowdapp@redhat.com

gluster-ant commented 4 years ago

Time: 20171208T05:37:16 bugzilla-bot at gluster.org commented: COMMIT: https://review.gluster.org/18947 committed in release-3.12 by \"Raghavendra G\" rgowdapp@redhat.com with a commit message- cluster/dht: don't overfill the buffer in readdir(p)

Superflous dentries that cannot be fit in the buffer size provided by kernel are thrown away by fuse-bridge. This means,

So, the best strategy would be to fill the buffer optimally - neither overfill nor underfill.

Change-Id: Idb3d85dd4c08fdc4526b2df801d49e69e439ba84 BUG: 1492625 Signed-off-by: Raghavendra G rgowdapp@redhat.com (cherry picked from commit e785faead91f74dce7c832848f2e8f3f43bd0be5) Change-Id: Idb3d85dd4c08fdc4526b2df801d49e69e439ba84 BUG: 1478411 Signed-off-by: Raghavendra G rgowdapp@redhat.com

gluster-ant commented 4 years ago

Time: 20171219T07:17:49 jthottan at redhat commented: This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-glusterfs-3.12.4, please open a new bug report.

glusterfs-glusterfs-3.12.4 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-devel/2017-December/054093.html [2] https://www.gluster.org/pipermail/gluster-users/

gluster-ant commented 4 years ago

Time: 20171229T02:03:05 nh2-redhatbugzilla at deditus.de commented: Hey, just wanted to let you know that this bug seems not fixed with glusterfs-3.12.4.

I still get getdents() calls returning only ~20 entries:

getdents(3, / 20 entries /, 131072) = 1040 getdents(3, / 20 entries /, 131072) = 1040 getdents(3, / 20 entries /, 131072) = 1040 getdents(3, / 6 entries /, 131072) = 312 getdents(3, / 0 entries /, 131072) = 0

The patch above also speaks of reduced CPU usage which I haven't tested yet.

I haven't tested with my kernel patch yet that allows FUSE readdir() to use more than 1 page, will retest with that.

gluster-ant commented 4 years ago

Time: 20171229T02:57:24 nh2-redhatbugzilla at deditus.de commented: I have retested with my FUSE kernel patch.

The issue remains as described in my comments 11 and 13: Every second getdents() call is still very slow:

strace -ftttT -e getdents ls -lU largenames-10k/ > /dev/null

1514515933.550686 getdents(3, / 604 entries /, 131072) = 43392 <0.021847> 1514515933.587737 getdents(3, / 603 entries /, 131072) = 43416 <0.001550> 1514515933.599690 getdents(3, / 601 entries /, 131072) = 43272 <0.012524> 1514515933.624767 getdents(3, / 601 entries /, 131072) = 43272 <0.001638> 1514515933.636231 getdents(3, / 601 entries /, 131072) = 43272 <0.020983> 1514515933.670622 getdents(3, / 601 entries /, 131072) = 43272 <0.001606> 1514515933.682672 getdents(3, / 601 entries /, 131072) = 43272 <0.023609> 1514515933.721695 getdents(3, / 601 entries /, 131072) = 43272 <0.001725> 1514515933.731581 getdents(3, / 601 entries /, 131072) = 43272 <0.019112> 1514515933.764764 getdents(3, / 601 entries /, 131072) = 43272 <0.001663> 1514515933.777859 getdents(3, / 601 entries /, 131072) = 43272 <0.024405> 1514515933.816322 getdents(3, / 601 entries /, 131072) = 43272 <0.003288> 1514515933.830961 getdents(3, / 601 entries /, 131072) = 43272 <0.033194> 1514515933.877653 getdents(3, / 601 entries /, 131072) = 43272 <0.001487> 1514515933.887763 getdents(3, / 601 entries /, 131072) = 43272 <0.021309> 1514515933.923075 getdents(3, / 601 entries /, 131072) = 43272 <0.001484> 1514515933.934226 getdents(3, / 381 entries /, 131072) = 27432 <0.001453> 1514515933.943821 getdents(3, / 0 entries /, 131072) = 0 <0.000040> 1514515933.967422 +++ exited with 0 +++

And ls -lU remaining faster than ls -1U:

strace -fcw -e getdents ls -1U largenames-10k/ > /dev/null

% time seconds usecs/call calls errors syscall


100.00 0.313788 17433 18 getdents


100.00 0.313788 18 total

strace -fcw -e getdents ls -lU largenames-10k/ > /dev/null

% time seconds usecs/call calls errors syscall


100.00 0.120517 6695 18 getdents


100.00 0.120517 18 total

gluster-ant commented 4 years ago

Time: 20171229T03:35:31 nh2-redhatbugzilla at deditus.de commented: Created attachment 1373661 brick trace log

gluster-ant commented 4 years ago

Time: 20171229T03:36:07 nh2-redhatbugzilla at deditus.de commented: Created attachment 1373662 mount trace log

gluster-ant commented 4 years ago

Time: 20171229T03:38:57 nh2-redhatbugzilla at deditus.de commented: Above I have attached the TRACE logs as requested in comment 16.

This is during the ls invocation:

strace -ftttT -e getdents ls -1U largenames-10k/ > /dev/null

1514518214.946014 getdents(3, / 604 entries /, 131072) = 43392 <0.220178> 1514518215.166640 getdents(3, / 603 entries /, 131072) = 43416 <0.066448> 1514518215.233425 getdents(3, / 601 entries /, 131072) = 43272 <0.195806> 1514518215.429612 getdents(3, / 601 entries /, 131072) = 43272 <0.000559> 1514518215.430538 getdents(3, / 601 entries /, 131072) = 43272 <0.139956> 1514518215.602667 getdents(3, / 601 entries /, 131072) = 43272 <0.008902> 1514518215.611953 getdents(3, / 601 entries /, 131072) = 43272 <0.050002> 1514518215.662359 getdents(3, / 601 entries /, 131072) = 43272 <0.000514> 1514518215.663226 getdents(3, / 601 entries /, 131072) = 43272 <0.117737> 1514518215.781354 getdents(3, / 601 entries /, 131072) = 43272 <0.000590> 1514518215.782331 getdents(3, / 601 entries /, 131072) = 43272 <0.107279> 1514518215.890031 getdents(3, / 601 entries /, 131072) = 43272 <0.000557> 1514518215.890959 getdents(3, / 601 entries /, 131072) = 43272 <0.093604> 1514518215.984941 getdents(3, / 601 entries /, 131072) = 43272 <0.000542> 1514518215.985842 getdents(3, / 601 entries /, 131072) = 43272 <0.126207> 1514518216.112435 getdents(3, / 601 entries /, 131072) = 43272 <0.000568> 1514518216.113365 getdents(3, / 381 entries /, 131072) = 27432 <0.035200> 1514518216.148826 getdents(3, / 0 entries /, 131072) = 0 <0.000178> 1514518216.149234 +++ exited with 0 +++

This operation is bracketed between setting the logging level from INFO to TRACE, and the other way around.

gluster-ant commented 4 years ago

Time: 20171229T04:41:20 nh2-redhatbugzilla at deditus.de commented: I have now tried all the things suggested in comment 15 (those patches I tested by using 3.13.1).

None of them have an effect on reducing the total amount spent in the getdents() to less than what's shown in the strace above.

gluster-ant commented 4 years ago

Time: 20180102T06:43:31 rgowdapp at redhat commented: (In reply to nh2 from comment #32)

I have now tried all the things suggested in comment 15 (those patches I tested by using 3.13.1).

None of them have an effect on reducing the total amount spent in the getdents() to less than what's shown in the strace above.

Thanks for your persistent efforts. We do acknowledge that readdirplus on bricks is a bottleneck. But, efforts required to improve that are significant and hence no major effort is currently in progress.

However, translator readdir-ahead tries to hide the readdirplus latency on bricks from application by initiating a stream of readdirplus (starts when application does an opendir) even before application has done a readdir. So, I am very much interested in your results from tests done in a setup with both of the following suggestions:

gluster-ant commented 4 years ago

Time: 20180102T10:49:03 nh2-redhatbugzilla at deditus.de commented: @ Raghavendra G:

I can't set rda-request-size that high, because:

volume set: failed: '1310720' in 'option rda-request-size 1310720' is out of range [4096 - 131072]

Can I just bump that limit in .max = 131072, in readdir-ahead.c or do I need to make other modifications to be able to use a higher value?

gluster-ant commented 4 years ago

Time: 20180102T12:20:17 nh2-redhatbugzilla at deditus.de commented: @ Raghavendra G:

OK, so with the rda-request-size limit bumped in readdir-ahead.c, and set to 1310720 on the volume, I get:

# gluster volume get myvol performance.rda-request-size
performance.rda-request-size            1310720

# strace -Te getdents ls -1U largenames-10k/ > /dev/null
getdents(3, /* 604 entries */, 131072)  = 43392 <0.102401>
getdents(3, /* 603 entries */, 131072)  = 43416 <0.001757>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.001937>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.007563>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.001913>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.003381>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.011234>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.018437>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.003767>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.002458>
getdents(3, /* 591 entries */, 131072)  = 42552 <0.011717>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.020970>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.001548>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.001877>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.002279>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.001776>
getdents(3, /* 391 entries */, 131072)  = 28152 <0.001263>
getdents(3, /* 0 entries */, 131072)    = 0 <0.000211>

That seems to do something!

The getdents() calls still return the same ~40k number of bytes as before, but the durations of the calls are now changed:

When before every ~2nd getdents took 0.1 seconds, a long call now happens only at the beginning for largenames-10k/.

However, that initial call now takes much longer, so that the total time for all getdents() calls together remains the same as in comment #28 and there is no time improvement overall (see another strace further down this post that shows this).

With largenames-100k:

getdents(3, /* 604 entries */, 131072)  = 43392 <0.163892>
getdents(3, /* 602 entries */, 131072)  = 43344 <0.003254>
getdents(3, /* 599 entries */, 131072)  = 43128 <0.003669>
getdents(3, /* 599 entries */, 131072)  = 43128 <0.010538>
getdents(3, /* 599 entries */, 131072)  = 43128 <0.007300>
getdents(3, /* 599 entries */, 131072)  = 43128 <0.017373>
getdents(3, /* 599 entries */, 131072)  = 43128 <0.003721>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.008421>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.016774>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.004317>
getdents(3, /* 593 entries */, 131072)  = 42696 <0.006771>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.132223>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003357>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003823>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.008572>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.006946>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005080>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.135234>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.008050>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.017798>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.015384>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.006461>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.079356>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003682>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.008991>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.009290>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005059>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.017135>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.009058>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.012989>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007093>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.006234>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.011052>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.117044>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003382>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003615>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.010157>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.009356>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.053361>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005857>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.018655>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007252>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.015048>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005514>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.099730>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.056949>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003871>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.018724>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003966>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.017172>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003817>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.013045>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.011726>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007532>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.016091>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.086207>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003886>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.042044>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005362>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003997>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.017465>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.009604>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.030751>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007936>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.015482>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007543>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.114697>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003654>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003635>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.010172>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007981>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.016444>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003645>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005491>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.018165>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003652>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.014853>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.109726>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007251>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.009851>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007128>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.018354>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.129458>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.013488>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.006543>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.006349>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.013647>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007909>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.052205>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003597>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003964>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.012631>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003580>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.020874>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003798>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003759>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.014521>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007281>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.015407>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.117141>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003797>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003647>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.012896>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005515>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.015374>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007836>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.013366>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.008682>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.144207>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.016566>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.052704>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003120>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.004119>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.013704>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003501>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003531>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.016954>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005842>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.014438>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007029>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.014823>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.114327>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003237>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.011245>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003815>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.006510>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005199>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.019505>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.008927>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003749>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.015448>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007190>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.179178>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.044503>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.013554>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007634>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005732>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.016517>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005310>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.015247>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.006601>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.010271>
getdents(3, /* 595 entries */, 131072)  = 42840 <0.008384>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.096027>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003394>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003681>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.011489>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005808>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.018045>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005568>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.013506>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007447>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.006223>
getdents(3, /* 587 entries */, 131072)  = 42264 <0.005347>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.219264>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.003878>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.014202>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.004581>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.012905>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.030729>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.041575>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.007530>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005656>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005534>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.008494>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.005954>
getdents(3, /* 598 entries */, 131072)  = 43056 <0.020728>
getdents(3, /* 140 entries */, 131072)  = 10080 <0.002299>
getdents(3, /* 0 entries */, 131072)    = 0 <0.001825>

That is changed too: Only around every 10th entry takes 0.1 seconds. But in sum, this still takes as much time as it did before, just that it's spent in larger clusters.

Another thing I notice is that performance goes bad when the directory is written to in parallel to calling the getdents (e.g. with a touch largenames-100k/{1..N} still running in th background): Then every single getdents() call takes ~0.1 seconds.

I have another question about what you said:

You mentioned that the "translator readdir-ahead tries to hide the readdirplus latency on bricks". Is readdir-latency on bricks really an issue? For example, ls -1U /data/glusterfs/myvol/brick1/brick/largenames-100k/ takes only 0.15 seconds real time on the brick when the directory is currently being written to, and 0.06 seconds if it's not being written to. On the largenames-10k on the brick, it's another 10x faster than for 100k, as expected. An individual getdents(3, /* 455 entries */, 32768) = 32760 call on the brick takes between 40 and 400 micro-seconds (average 80 us) on the brick when the directory is being written to, and 4x less when it's not being written to. So I would find it very surprising if readdir latency on the brick FS itself could make any noticeable impact; can you explain it a bit?

Comparing directly the impact of rda-request-size 1310720 vs 131072:

With rda-request-size 1310720 (10x higher than default):

# gluster vol set myvol performance.rda-request-size 1310720
volume set: success

# strace -Te getdents ls -1U largenames-10k/ > /dev/null
getdents(3, /* 604 entries */, 131072)  = 43392 <0.128277>
getdents(3, /* 603 entries */, 131072)  = 43416 <0.001492>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.001961>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.001765>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.006883>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.003683>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.003105>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.015737>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.001787>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.001689>
getdents(3, /* 591 entries */, 131072)  = 42552 <0.012255>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.033856>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.001385>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.001721>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.001668>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.002190>
getdents(3, /* 391 entries */, 131072)  = 28152 <0.001552>
getdents(3, /* 0 entries */, 131072)    = 0 <0.000254>
+++ exited with 0 +++

# strace -wcfe getdents ls -1U largenames-10k/ > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.210427       11690        18           getdents
------ ----------- ----------- --------- --------- ----------------
100.00    0.210427                    18           total

With rda-request-size 131072 (default):

# gluster vol set myvol performance.rda-request-size 131072
volume set: success

# strace -Te getdents ls -1U largenames-10k/ > /dev/null
getdents(3, /* 604 entries */, 131072)  = 43392 <0.021472>
getdents(3, /* 602 entries */, 131072)  = 43344 <0.001393>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.025918>
getdents(3, /* 599 entries */, 131072)  = 43128 <0.001342>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.027952>
getdents(3, /* 599 entries */, 131072)  = 43128 <0.001415>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.024628>
getdents(3, /* 599 entries */, 131072)  = 43128 <0.001430>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.025017>
getdents(3, /* 599 entries */, 131072)  = 43128 <0.001329>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.033111>
getdents(3, /* 599 entries */, 131072)  = 43128 <0.001497>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.033382>
getdents(3, /* 599 entries */, 131072)  = 43128 <0.001328>
getdents(3, /* 601 entries */, 131072)  = 43272 <0.034641>
getdents(3, /* 599 entries */, 131072)  = 43128 <0.001393>
getdents(3, /* 396 entries */, 131072)  = 28512 <0.013024>
getdents(3, /* 0 entries */, 131072)    = 0 <0.000068>
+++ exited with 0 +++

# strace -wcfe getdents ls -1U largenames-10k/ > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.218633       12146        18           getdents
------ ----------- ----------- --------- --------- ----------------
100.00    0.218633                    18           total

In the above we can see: The same total time is spent, but with the larger rda-request-size 1310720 it's spent mostly upfront.

That begs the question: What is gluster doing that makes the first getdents() call take entire 0.12 seconds? As mentioned before, time ls -1U /data/glusterfs/benacofs-staging/brick1/brick/largenames-10k/ > /dev/null on the brick takes 0.007 seconds, so it can't be brick FS latency. It also can't be network latency, as the ping between the machines is 0.0004 seconds. And it can't be a network throughput bottleneck, as the total throughput here is 6 MB/s (comparable to the 5 MB/s I reported earlier), and my has a much faster link.

My suspicion: Gluster is burning this time as CPU cycles.

The current TIME+ shown in htop for glusterfs and glusterfsd are:

5:47.62 glusterfs 4:52.75 glusterfsd

and not moving.

After I run time ls -1U largenames-100k/ > /dev/null which takes real0m3.284s` they are:

5:48.51 glusterfs 4:54.56 glusterfsd

In relative times, the increase is:

0.89 glusterfs 1.81 glusterfsd

This is a very large chunk of the 3.28 seconds that the ls takes.

Also consider: The total amount of data returned by getdents() for largenames-100k/ is around 7 MB. glusterfsd spends 1.81 seconds CPU time to produce these 7 MB. That makes ~3.8 MB/s CPU throughput. This is not a reasonable number for CPU processing throughput. Gluster must be doing something extremely inefficient to spend so much CPU time at so little data.

gluster-ant commented 4 years ago

Time: 20180117T08:00:42 rgowdapp at redhat commented:

I have another question about what you said:

You mentioned that the "translator readdir-ahead tries to hide the readdirplus latency on bricks". Is readdir-latency on bricks really an issue? For example, ls -1U /data/glusterfs/myvol/brick1/brick/largenames-100k/ takes only 0.15 seconds real time on the brick when the directory is currently being written to, and 0.06 seconds if it's not being written to. On the largenames-10k on the brick, it's another 10x faster than for 100k, as expected. An individual getdents(3, /* 455 entries */, 32768) = 32760 call on the brick takes between 40 and 400 micro-seconds (average 80 us) on the brick when the directory is being written to, and 4x less when it's not being written to. So I would find it very surprising if readdir latency on the brick FS itself could make any noticeable impact; can you explain it a bit?

Glusterfs does following things on a brick during a readdirplus call (A readdirplus - like NFS readdirplus - is a readdir + collection of metadata like xattrs, stat for each dentry so that kernel don't have to do a lookup/stat for each dentry and by default every readdir is converted to readdirplus):

So, there is list processing involved here. Note also that this list processing happens more than once at different layers. This list processing could be the reason for cpu usage. Note that we can try some more experiments:

Experiment 1 - Isolate the overhead required to access metadata of each dentry:

Experiment 2 - Isolate the overhead of iterating through list of dentries:

In this experiment we'll fetch metadata of each dentry, but without using readdirplus. As I mentioned above, different xlators traverse through the list in the scope of a single readdirplus call. So, there are multiple iterations. However, we can still access metadata of each dentry without using readdirplus. In this case kernel initiates lookup/stat calls for each dentry it fetched through readdir. The crucial difference is that list processing is minimal in glusterfs as most of the xlators are not interested in just dentries (instead they are interested in inode pointed out by dentry. Since readdir doesn't have inode/stat information they just pass the list to higher layers without processing it).

We can compare results from these two experiments with the information provided in comment #35.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html

gluster-ant commented 4 years ago

Time: 20180117T08:02:47 rgowdapp at redhat commented: (In reply to Raghavendra G from comment #36)

I have another question about what you said:

You mentioned that the "translator readdir-ahead tries to hide the readdirplus latency on bricks". Is readdir-latency on bricks really an issue? For example, ls -1U /data/glusterfs/myvol/brick1/brick/largenames-100k/ takes only 0.15 seconds real time on the brick when the directory is currently being written to, and 0.06 seconds if it's not being written to. On the largenames-10k on the brick, it's another 10x faster than for 100k, as expected. An individual getdents(3, /* 455 entries */, 32768) = 32760 call on the brick takes between 40 and 400 micro-seconds (average 80 us) on the brick when the directory is being written to, and 4x less when it's not being written to. So I would find it very surprising if readdir latency on the brick FS itself could make any noticeable impact; can you explain it a bit?

Glusterfs does following things on a brick during a readdirplus call (A readdirplus - like NFS readdirplus - is a readdir + collection of metadata like xattrs, stat for each dentry so that kernel don't have to do a lookup/stat for each dentry and by default every readdir is converted to readdirplus):

  • getdents
  • For each dentry returned by getdents do:
    • lstat (path pointed out by dentry)
    • multiple getxattr calls to fetch various xattrs storing glusterfs metadata on the file.

So, there is list processing involved here. Note also that this list processing happens more than once at different layers. This list processing could be the reason for cpu usage. Note that we can try some more experiments:

Experiment 1 - Isolate the overhead required to access metadata of each dentry:

==

  • Disable readdirplus entirely in glusterfs stack. Refer to [1] for details on how to do it.
  • Run a crawler like find or ls, but without accessing each dentry itself. IOW, we just list the contents of the directory. This will help us to find out how much of overhead is spent in collecting metadata of contents of directory.

We can use "find . > /dev/null" to do this.

Experiment 2 - Isolate the overhead of iterating through list of dentries:

In this experiment we'll fetch metadata of each dentry, but without using readdirplus. As I mentioned above, different xlators traverse through the list in the scope of a single readdirplus call. So, there are multiple iterations. However, we can still access metadata of each dentry without using readdirplus. In this case kernel initiates lookup/stat calls for each dentry it fetched through readdir. The crucial difference is that list processing is minimal in glusterfs as most of the xlators are not interested in just dentries (instead they are interested in inode pointed out by dentry. Since readdir doesn't have inode/stat information they just pass the list to higher layers without processing it).

  • Disable readdirplus entirely in glusterfs stack. Refer to [1] for details on how to do it.
  • Run a crawler like find or ls, but this time let it access each file. We can use commands like "ls -lR" or find . -exec stat {} \;

We can compare results from these two experiments with the information provided in comment #35.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html

gluster-ant commented 4 years ago

Time: 20180117T08:19:44 rgowdapp at redhat commented: (In reply to Raghavendra G from comment #36)

I have another question about what you said:

You mentioned that the "translator readdir-ahead tries to hide the readdirplus latency on bricks". Is readdir-latency on bricks really an issue? For example, ls -1U /data/glusterfs/myvol/brick1/brick/largenames-100k/ takes only 0.15 seconds real time on the brick when the directory is currently being written to, and 0.06 seconds if it's not being written to. On the largenames-10k on the brick, it's another 10x faster than for 100k, as expected. An individual getdents(3, /* 455 entries */, 32768) = 32760 call on the brick takes between 40 and 400 micro-seconds (average 80 us) on the brick when the directory is being written to, and 4x less when it's not being written to. So I would find it very surprising if readdir latency on the brick FS itself could make any noticeable impact; can you explain it a bit?

Glusterfs does following things on a brick during a readdirplus call (A readdirplus - like NFS readdirplus - is a readdir + collection of metadata like xattrs, stat for each dentry so that kernel don't have to do a lookup/stat for each dentry and by default every readdir is converted to readdirplus):

  • getdents
  • For each dentry returned by getdents do:
    • lstat (path pointed out by dentry)
    • multiple getxattr calls to fetch various xattrs storing glusterfs metadata on the file.

So, there is list processing involved here. Note also that this list processing happens more than once at different layers. This list processing could be the reason for cpu usage. Note that we can try some more experiments:

Experiment 1 - Isolate the overhead required to access metadata of each dentry:

==

  • Disable readdirplus entirely in glusterfs stack. Refer to [1] for details on how to do it.
  • Run a crawler like find or ls, but without accessing each dentry itself. IOW, we just list the contents of the directory. This will help us to find out how much of overhead is spent in collecting metadata of contents of directory.

Experiment 2 - Isolate the overhead of iterating through list of dentries:

In this experiment we'll fetch metadata of each dentry, but without using readdirplus. As I mentioned above, different xlators traverse through the list in the scope of a single readdirplus call. So, there are multiple iterations. However, we can still access metadata of each dentry without using readdirplus. In this case kernel initiates lookup/stat calls for each dentry it fetched through readdir. The crucial difference is that list processing is minimal in glusterfs as most of the xlators are not interested in just dentries (instead they are interested in inode pointed out by dentry. Since readdir doesn't have inode/stat information they just pass the list to higher layers without processing it).

But this adds a an extra latency for a lookup call/response to traverse through glusterfs stack and network for each dentry. Hence this skews the results. So, the delta (as compared to the numbers we saw in comment #35) we actually see is,

delta (D) = latency of readdirplus (R) + n * lookup latency (L) - time spent in processing dentry list by various xlators (I).

Where n is the number of dentries in the list.

So, I = R + L - D

We can deduce (n * lookup latency) (L) using results from experiment 1.

L = Total time taken in experiment 2 (T2) - Total time taken in experiment 1 (T1)

So,

I = R + (T2 - T1) - D

  • Disable readdirplus entirely in glusterfs stack. Refer to [1] for details on how to do it.
  • Run a crawler like find or ls, but this time let it access each file. We can use commands like "ls -lR" or find . -exec stat {} \;

We can compare results from these two experiments with the information provided in comment #35.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html

gluster-ant commented 4 years ago

Time: 20181023T14:54:32 srangana at redhat commented: Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.

stale[bot] commented 4 years ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

nh2 commented 4 years ago

It will be closed in 2 weeks if no one responds with a comment here.

I'm no longer using gluster, so I cannot tell whether this bug I originally filed still exists, but in the interest of helping the project (given that I used it for a while and it brought me great value):

Auto-closing bots are dumb, an anti-productivity feature, and the admission to not have a proper bug triage process. Design problems like this do not disappear because nobody has commented on them, and nobody will bother commenting every 6 months to keep them alive.

Using bots like this you will lose accurate information of what are unsolved problems, and frustrate community contributers (I've seen it many times; it is common for active Github users to be subscribed to 1000s of issues, and if a bot pings each of them every 6 months, this creates tens of useless emails per day).

stale[bot] commented 4 years ago

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

nh2 commented 4 years ago

Closing this issue as there was no update since my last update on issue.

That's not true, I posted...

If this is an issue which is still valid, feel free to open it.

Normal people cannot reopen issues, so I cannot. Whoever decided to use the stale bot could have considered how Github works first.

amarts commented 4 years ago

Normal people cannot reopen issues, so I cannot. Whoever decided to use the stale bot could have considered how Github works first.

Not sure why stalebot didn't consider your comment! Will surely fix that part!

pranithk commented 4 years ago

It will be closed in 2 weeks if no one responds with a comment here.

I'm no longer using gluster, so I cannot tell whether this bug I originally filed still exists, but in the interest of helping the project (given that I used it for a while and it brought me great value):

Auto-closing bots are dumb, an anti-productivity feature, and the admission to not have a proper bug triage process. Design problems like this do not disappear because nobody has commented on them, and nobody will bother commenting every 6 months to keep them alive.

Using bots like this you will lose accurate information of what are unsolved problems, and frustrate community contributers (I've seen it many times; it is common for active Github users to be subscribed to 1000s of issues, and if a bot pings each of them every 6 months, this creates tens of useless emails per day).

Thanks for the feedback. I will raise this in community meeting

Nottt commented 3 years ago

I still have this issue, any hope of fixing it?

stale[bot] commented 2 years ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] commented 2 years ago

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.