Open LillianZ opened 4 years ago
There's a middleware on SFR using Dir#[]
. The microbenchmark shows that it's ~1.6x slower than MRI.
I can reproduce the difference locally on Linux:
ruby 2.7.3p183 (2021-04-05 revision 6847ee089d) [x86_64-linux]
0.000000 0.000860 0.000860 ( 0.000862)
Warming up --------------------------------------
Dir#[] 126.000 i/100ms
Calculating -------------------------------------
Dir#[] 1.288k (± 1.1%) i/s - 6.552k in 5.089233s
0.000000 0.000807 0.000807 ( 0.000809)
truffleruby 21.2.0-dev-a28e3da1, like ruby 2.7.3, GraalVM CE Native [x86_64-linux]
0.374936 0.016222 0.391158 ( 0.063269)
Warming up --------------------------------------
Dir#[] 20.000 i/100ms
Calculating -------------------------------------
Dir#[] 206.467 (±16.5%) i/s - 1.000k in 5.017073s
0.031048 0.002053 0.033101 ( 0.004951)
$ ruby --cpusampler --cpusampler.Mode=roots bench_dir_glob.rb
----------------------------------------------------------------------------------------------------------------------------------------------------
Sampling Histogram. Recorded 12653 samples with period 1ms.
Self Time: Time spent on the top of the stack.
Total Time: Time spent somewhere on the stack.
Opt %: Percent of time spent in compiled and therefore non-interpreted code.
----------------------------------------------------------------------------------------------------------------------------------------------------
Thread: Thread[main,5,main]
Name | Total Time | Opt % || Self Time | Opt % | Location
----------------------------------------------------------------------------------------------------------------------------------------------------
Truffle::POSIX.truffleposix_readdir | 4584ms 36.2% | 99.4% || 2405ms 19.0% | 99.3% | resource:/truffleruby/core/posix.rb~134-168:3839-5070
Truffle::POSIX.truffleposix_lstat_mode | 3111ms 24.6% | 99.5% || 2386ms 18.9% | 99.5% | resource:/truffleruby/core/posix.rb~134-168:3839-5070
FFI::Pointer#get_string | 1682ms 13.3% | 99.8% || 1679ms 13.3% | 99.8% | resource:/truffleruby/core/truffle/ffi/pointer.rb~144-146:4775-4899
String#<=> | 1532ms 12.1% | 99.5% || 1532ms 12.1% | 99.5% | resource:/truffleruby/core/string.rb~1525-1547:40019-40731
Truffle::POSIX.opendir | 665ms 5.3% | 97.7% || 607ms 4.8% | 98.5% | resource:/truffleruby/core/posix.rb~134-168:3839-5070
block (2 levels) in POSIX.attach_function_eagerly | 578ms 4.6% | 98.6% || 565ms 4.5% | 98.8% | resource:/truffleruby/core/posix.rb~135-138:3897-4131
Array#bottom_up_merge | 1056ms 8.3% | 98.2% || 466ms 3.7% | 96.8% | resource:/truffleruby/core/array.rb~1365-1384:31295-31780
Symbol#== | 359ms 2.8% | 100.0% || 358ms 2.8% | 100.0% | (core)~1:0
Dir::Glob::RecursiveDirectories#call | 10310ms 81.5% | 96.4% || 344ms 2.7% | 74.7% | resource:/truffleruby/core/dir_glob.rb~93-127:3143-4310
Truffle::POSIX.closedir | 361ms 2.9% | 99.4% || 340ms 2.7% | 99.4% | resource:/truffleruby/core/posix.rb~134-168:3839-5070
FFI::Pointer#read_string_to_null | 1943ms 15.4% | 99.8% || 259ms 2.0% | 99.6% | resource:/truffleruby/core/truffle/ffi/pointer_extra.rb~85-87:3357-3456
Truffle::POSIX.truffleposix_stat_mode | 257ms 2.0% | 98.1% || 211ms 1.7% | 97.6% | resource:/truffleruby/core/posix.rb~134-168:3839-5070
Array#isort! | 1150ms 9.1% | 99.2% || 208ms 1.6% | 97.6% | resource:/truffleruby/core/array.rb~1452-1477:33316-33746
Dir::Glob::EntryMatch#call | 3775ms 29.8% | 99.2% || 172ms 1.4% | 98.3% | resource:/truffleruby/core/dir_glob.rb~215-231:6804-7322
File.fnmatch | 322ms 2.5% | 98.8% || 121ms 1.0% | 98.3% | resource:/truffleruby/core/file.rb~668-681:21612-22069
Dir#read | 4714ms 37.3% | 99.3% || 99ms 0.8% | 90.9% | resource:/truffleruby/core/dir.rb~99-114:3110-3527
Array#each | 614ms 4.9% | 99.3% || 93ms 0.7% | 100.0% | (core)~1:0
Kernel#StringValue | 106ms 0.8% | 100.0% || 84ms 0.7% | 100.0% | resource:/truffleruby/core/kernel.rb~170-172:5816-5898
String#include? | 176ms 1.4% | 100.0% || 74ms 0.6% | 100.0% | resource:/truffleruby/core/string.rb~150-152:4643-4748
Dir::Glob.path_join | 69ms 0.5% | 97.1% || 69ms 0.5% | 97.1% | resource:/truffleruby/core/dir_glob.rb~337-345:10022-10221
Array#<< | 55ms 0.4% | 98.2% || 55ms 0.4% | 98.2% | (core)~1:0
Dir::Glob::Match#match? | 374ms 3.0% | 98.9% || 54ms 0.4% | 100.0% | resource:/truffleruby/core/dir_glob.rb~185-187:6066-6136
Dir::Glob::Node#path_join | 137ms 1.1% | 89.8% || 45ms 0.4% | 75.6% | resource:/truffleruby/core/dir_glob.rb~51-53:2110-2203
Dir#initialize_internal | 706ms 5.6% | 97.9% || 33ms 0.3% | 90.9% | resource:/truffleruby/core/dir.rb~48-63:2166-2582
String#encoding | 27ms 0.2% | 100.0% || 27ms 0.2% | 100.0% | resource:/truffleruby/core/string.rb~1521-1523:39949-40016
Dir::Glob::Node#separator | 23ms 0.2% | 95.7% || 23ms 0.2% | 95.7% | resource:/truffleruby/core/dir_glob.rb~47-49:2053-2107
Truffle::Type.rb_convert_type | 22ms 0.2% | 100.0% || 22ms 0.2% | 100.0% | resource:/truffleruby/core/type.rb~263-273:9204-9581
String#force_encoding | 21ms 0.2% | 100.0% || 21ms 0.2% | 100.0% | (core)~1:0
Dir#close | 382ms 3.0% | 98.7% || 19ms 0.2% | 89.5% | resource:/truffleruby/core/dir.rb~116-122:3530-3664
Quite some time spent in native calls, which is expected. Maybe we can group some of them (e.g., do we need Truffle::POSIX.truffleposix_lstat_mode
?)
Something else we see in the profile is the cost of sorting the returned entries.
I'm wondering if it really makes the same native calls as MRI - like I can imagine it accidentally being written to stat each component of every path each time, rather than stating a component once and then stating its children once etc. Or something else like that.
We could potentially compare with strace for system calls.
One way to avoid the extra lstat
(seems used to find out if a given entry is a directory) would be to use readdir's d_type
field, but:
unsigned char d_type; /* Type of file; not supported
by all filesystem types */
Currently, only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type. All applications must properly handle a return of DT_UNKNOWN.
So we'd need a fallback to lstat. And know whether that works on macOS too.
@eregon I suspect d_type
works on macOS after seeing some straces from MRI. There appears to be 2 extra lstat in TruffleRuby for every file path currently. I believe this is due to not using d_type
.
We also appear to have some duplicated calls to POSIX.opendir
on some directories. On mac I captured this CPU sample:
Name | Total Time | Opt % || Self Time | Opt % | Location
---------------------------------------------------------------------------------------------------------------------------------------------------------
Truffle::POSIX.opendir | 4327ms 43.8% | 96.4% || 4285ms 43.4% | 96.8% | src/main/ruby/truffleruby/core/posix.rb~134-168:3839-5070
Truffle::POSIX.truffleposix_readdir | 2040ms 20.7% | 89.7% || 1605ms 16.3% | 92.0% | src/main/ruby/truffleruby/core/posix.rb~134-168:3839-5070
Truffle::POSIX.truffleposix_lstat_mode | 1198ms 12.1% | 90.2% || 1083ms 11.0% | 93.4% | src/main/ruby/truffleruby/core/posix.rb~134-168:3839-5070
Truffle::POSIX.closedir | 473ms 4.8% | 91.8% || 468ms 4.7% | 92.1% | src/main/ruby/truffleruby/core/posix.rb~134-168:3839-5070
sample from strace output: CRuby:
openat(AT_FDCWD, "src/main/java/org/truffleruby/interop", O_RDONLY|O_CLOEXEC|O_DIRECTORY) = 8
fstat(8, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fcntl(8, F_GETFL) = 0x18000 (flags O_RDONLY|O_LARGEFILE|O_DIRECTORY)
fcntl(8, F_SETFD, FD_CLOEXEC) = 0
getdents64(8, 0xd9e450 /* 19 entries */, 32768) = 880
getdents64(8, 0xd9e450 /* 0 entries */, 32768) = 0
close(8) = 0
TruffleRuby:
openat(AT_FDCWD, "src/main/java/org/truffleruby/interop", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 7
fstat(7, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
getdents64(7, 0x559ac7ad1780 /* 19 entries */, 32768) = 880
lstat("src/main/java/org/truffleruby/interop/InteropNodes.java", {st_mode=S_IFREG|0644, st_size=117806, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/BoxedValue.java", {st_mode=S_IFREG|0644, st_size=1187, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/TranslateInteropExceptionNode.java", {st_mode=S_IFREG|0644, st_size=6267, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/IsStringLikeNode.java", {st_mode=S_IFREG|0644, st_size=1589, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/RubyToForeignArgumentsNode.java", {st_mode=S_IFREG|0644, st_size=2043, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/SymbolToByteOrderNode.java", {st_mode=S_IFREG|0644, st_size=1793, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/InteropManager.java", {st_mode=S_IFREG|0644, st_size=1093, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/PolyglotNodes.java", {st_mode=S_IFREG|0644, st_size=7073, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/TranslateInteropRubyExceptionNode.java", {st_mode=S_IFREG|0644, st_size=9477, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/ToJavaStringWithDefaultNode.java", {st_mode=S_IFREG|0644, st_size=1404, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/ToJavaStringNode.java", {st_mode=S_IFREG|0644, st_size=4141, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/ForeignToRubyArgumentsNode.java", {st_mode=S_IFREG|0644, st_size=2418, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/ForeignToRubyNode.java", {st_mode=S_IFREG|0644, st_size=1588, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/FromJavaStringNode.java", {st_mode=S_IFREG|0644, st_size=2235, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/ProxyForeignObject.java", {st_mode=S_IFREG|0644, st_size=3029, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/OutgoingForeignCallNode.java", {st_mode=S_IFREG|0644, st_size=26132, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/RubyToForeignNode.java", {st_mode=S_IFREG|0644, st_size=1368, ...}) = 0
getdents64(7, 0x559ac7ad1780 /* 0 entries */, 32768) = 0
close(7) = 0
So CRuby seems to use the trick I described in https://github.com/oracle/truffleruby/issues/2092#issuecomment-841286446, and that's probably a significant gain.
Probably we need some extra private method to do something like Dir#read
+ return whether it's a directory/the file type.
We could also add a variant for efficiency which also automatically skip .
and ..
where that's the correct semantics.
I've taken a bit of a look at this, and we'll need an alternative way to read directory entries (we currently return the name from libtruffleposix
when we really want the directory entry structs. I think we're also opening and reading every directory twice, once to handle the recursive * part of the pattern, and then a second time to perform the leaf name matching `` part.
The cases where we do need to stat entries can probably be sped up a little by using fstataf
and passing in the fd of the directory, but we'd need to test that to see how much of a difference that makes.
Going between Ruby and POSIX calls means that they're doing a lot of overhead to allocate structures and function calls just to read fields and things. Maybe a richer function in libtruffleposix could do most of the globbing work staying in C?
Possibly, but I'm not sure I want to be the one to write that implementation. :-) I'll have a think about what we could do to divide the work up in a smarter way, and I'll try benchmarking fstatat and see if that would make a difference, I think it might be sensible to switch anyway because we likely have some horrible race conditions if anybody changes current working directory in another thread while we're in the middle of globbing.
Is this a case where we can copy and paste from MRI?
Have you looked at the MRI implementation? Also, they definitely suffer from the race condition problem (because although they use fstatat they use AT_FDCWD
as the base, so they are always working relative to the process's current working directory). Maybe the don't release the GVL during a glob and therefore know Ruby code can't change the cwd? I haven't checked that but yet.
This seems to be the inner-most function, for reference.
https://github.com/ruby/ruby/blob/86b4c2fc3fb0ec815e212628f6708c5b777f523e/dir.c#L2255-L2599
Quite a lot of helpers around this.
Okay, I've got a test implementation for using readdir_r
and filling in a dirent
structure to get the type along with the directory entry, bu on my machine it only helps a little bit as XFS
returns DT_UNKNOWN
for regular files so I have to stat them anyway, but it does tell me that directories are DT_DIR
s so I does save some calls to lstat. On other file systems this won't be necessary (btrfs, ext3, and ext4 return the type information for every directory entry, I don't know about zfs or Apple's apfs). Overall this doesn't make a huge difference on my machine, but might help considerably on other systems.
Using fstatat
makes a small difference, and is probably safer than lstat.
A bit more work to eliminate reading directories twice has more of an effect, getting us up to about 65% of MRI's performance.
Disabling sorting of the glob results speeds everything up by about 20% so might be worth doing. MRI doesn't sort, and the original justification was that Apple did sort and other platforms didn't. That was true for HFS+ volumes, but isn't true for APFS volumes, and I'm not sure if it's true for ISO volumes either.
There is still some more work we can do. We're ultimately using File.fnmatch
to check for leaf matches, but this function can handle the full glob syntax, is behind a TruffleBoundary
, and is interpreted on every call. We also spend a quite a bit of time checking paths etc. when creating directory objects, We should also work to collapse successive **
patterns etc.
@aardvark179 I have a ZFS on Linux system running if you need me to test/verify anything.
We have to stat anything that is returned with a DT_UNKNOWN
type, and having dug through file system documentation I now understand why I saw a different number of stat calls to others.
This is faster but still not perfect.
ruby 2.7.3p183 (2021-04-05 revision 6847ee089d) [x86_64-darwin20]
Warming up --------------------------------------
Dir#[] 19.000 i/100ms
Calculating -------------------------------------
Dir#[] 206.360 (± 9.7%) i/s - 1.026k in 5.019304s
truffleruby 21.2.0-dev-a50c8e52, like ruby 2.7.3, GraalVM CE JVM [x86_64-darwin]
Warming up --------------------------------------
Dir#[] 12.000 i/100ms
Calculating -------------------------------------
Dir#[] 175.075 (± 9.7%) i/s - 876.000 in 5.052381s
Is that what you saw as well, Duncan?
I've taken another look at this, and it looks like most of the remaining difference is from sorting the array. I've just done a quick micro benchmark of sorting large string arrays, and MRI is over twice as fast, so it's probably worth taking a look at that.
Improving the string comparison used in Array#sort!
gets us closer to MRI's performance, and we're within 10% if sorting is disabled, but sorting is still more expensive on TruffleRuby. That's mostly because we sort all the entries once they have been matched, but MRI is doing the sort as it goes. I've tried refactoring our code to do something similar, but it cost us as much as we gained, so I'd need to do some additional work.
Even when hot (confirmed hot by Thermometer), Dir[] is almost 4x slower than MRI. Code snippet below was run in this directory, and MRI and Truffleruby output below.