oracle / truffleruby

A high performance implementation of the Ruby programming language, built on GraalVM.
https://www.graalvm.org/ruby/
Other
3.02k stars 185 forks source link

Dir[] slow, hot & cold #2092

Open LillianZ opened 4 years ago

LillianZ commented 4 years ago

Even when hot (confirmed hot by Thermometer), Dir[] is almost 4x slower than MRI. Code snippet below was run in this directory, and MRI and Truffleruby output below.

Cold Hot
MRI 0.003747s 304.699 ips
Truffleruby 0.122712s 82.758 ips
require 'benchmark'
require 'benchmark/ips'
puts  Benchmark.measure {Dir["src/main/java/**/*"]}

Benchmark.ips do |x|
  x.config(:time => 5, :warmup => 10)
  x.report('Dir#[]') {Dir["src/main/java/**/*"]}
end

puts Benchmark.measure {Dir["src/main/java/**/*"]}
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-darwin19]
  0.000946   0.002803   0.003749 (  0.003747)
Warming up --------------------------------------
          Dir#[]    31.000  i/100ms
Calculating -------------------------------------
          Dir#[]    304.699  (± 5.6%) i/s -      1.519k in   5.000986s
  0.001018   0.002456   0.003474 (  0.003476)
truffleruby (Shopify) 20.3.0-dev-151cfcb0, like ruby 2.6.6, GraalVM CE JVM [x86_64-darwin]
  0.733384   0.017023   0.750407 (  0.122712)
Warming up --------------------------------------
          Dir#[]     6.000  i/100ms
Calculating -------------------------------------
          Dir#[]     82.758  (± 7.3%) i/s -    414.000  in   5.027133s
  0.047519   0.009553   0.057072 (  0.015891)
wildmaples commented 3 years ago

There's a middleware on SFR using Dir#[]. The microbenchmark shows that it's ~1.6x slower than MRI.

eregon commented 3 years ago

I can reproduce the difference locally on Linux:

ruby 2.7.3p183 (2021-04-05 revision 6847ee089d) [x86_64-linux]
  0.000000   0.000860   0.000860 (  0.000862)
Warming up --------------------------------------
              Dir#[]   126.000  i/100ms
Calculating -------------------------------------
              Dir#[]      1.288k (± 1.1%) i/s -      6.552k in   5.089233s
  0.000000   0.000807   0.000807 (  0.000809)

truffleruby 21.2.0-dev-a28e3da1, like ruby 2.7.3, GraalVM CE Native [x86_64-linux]
  0.374936   0.016222   0.391158 (  0.063269)
Warming up --------------------------------------
              Dir#[]    20.000  i/100ms
Calculating -------------------------------------
              Dir#[]    206.467  (±16.5%) i/s -      1.000k in   5.017073s
  0.031048   0.002053   0.033101 (  0.004951)
eregon commented 3 years ago
$ ruby --cpusampler --cpusampler.Mode=roots bench_dir_glob.rb
----------------------------------------------------------------------------------------------------------------------------------------------------
Sampling Histogram. Recorded 12653 samples with period 1ms.
  Self Time: Time spent on the top of the stack.
  Total Time: Time spent somewhere on the stack.
  Opt %: Percent of time spent in compiled and therefore non-interpreted code.
----------------------------------------------------------------------------------------------------------------------------------------------------
 Thread: Thread[main,5,main]
 Name                                                         |      Total Time     |  Opt % ||       Self Time     |  Opt % | Location             
----------------------------------------------------------------------------------------------------------------------------------------------------
 Truffle::POSIX.truffleposix_readdir                          |       4584ms  36.2% |  99.4% ||       2405ms  19.0% |  99.3% | resource:/truffleruby/core/posix.rb~134-168:3839-5070 
 Truffle::POSIX.truffleposix_lstat_mode                       |       3111ms  24.6% |  99.5% ||       2386ms  18.9% |  99.5% | resource:/truffleruby/core/posix.rb~134-168:3839-5070 
 FFI::Pointer#get_string                                      |       1682ms  13.3% |  99.8% ||       1679ms  13.3% |  99.8% | resource:/truffleruby/core/truffle/ffi/pointer.rb~144-146:4775-4899 
 String#<=>                                                   |       1532ms  12.1% |  99.5% ||       1532ms  12.1% |  99.5% | resource:/truffleruby/core/string.rb~1525-1547:40019-40731 
 Truffle::POSIX.opendir                                       |        665ms   5.3% |  97.7% ||        607ms   4.8% |  98.5% | resource:/truffleruby/core/posix.rb~134-168:3839-5070 
 block (2 levels) in POSIX.attach_function_eagerly            |        578ms   4.6% |  98.6% ||        565ms   4.5% |  98.8% | resource:/truffleruby/core/posix.rb~135-138:3897-4131 
 Array#bottom_up_merge                                        |       1056ms   8.3% |  98.2% ||        466ms   3.7% |  96.8% | resource:/truffleruby/core/array.rb~1365-1384:31295-31780 
 Symbol#==                                                    |        359ms   2.8% | 100.0% ||        358ms   2.8% | 100.0% | (core)~1:0 
 Dir::Glob::RecursiveDirectories#call                         |      10310ms  81.5% |  96.4% ||        344ms   2.7% |  74.7% | resource:/truffleruby/core/dir_glob.rb~93-127:3143-4310 
 Truffle::POSIX.closedir                                      |        361ms   2.9% |  99.4% ||        340ms   2.7% |  99.4% | resource:/truffleruby/core/posix.rb~134-168:3839-5070 
 FFI::Pointer#read_string_to_null                             |       1943ms  15.4% |  99.8% ||        259ms   2.0% |  99.6% | resource:/truffleruby/core/truffle/ffi/pointer_extra.rb~85-87:3357-3456 
 Truffle::POSIX.truffleposix_stat_mode                        |        257ms   2.0% |  98.1% ||        211ms   1.7% |  97.6% | resource:/truffleruby/core/posix.rb~134-168:3839-5070 
 Array#isort!                                                 |       1150ms   9.1% |  99.2% ||        208ms   1.6% |  97.6% | resource:/truffleruby/core/array.rb~1452-1477:33316-33746 
 Dir::Glob::EntryMatch#call                                   |       3775ms  29.8% |  99.2% ||        172ms   1.4% |  98.3% | resource:/truffleruby/core/dir_glob.rb~215-231:6804-7322 
 File.fnmatch                                                 |        322ms   2.5% |  98.8% ||        121ms   1.0% |  98.3% | resource:/truffleruby/core/file.rb~668-681:21612-22069 
 Dir#read                                                     |       4714ms  37.3% |  99.3% ||         99ms   0.8% |  90.9% | resource:/truffleruby/core/dir.rb~99-114:3110-3527 
 Array#each                                                   |        614ms   4.9% |  99.3% ||         93ms   0.7% | 100.0% | (core)~1:0 
 Kernel#StringValue                                           |        106ms   0.8% | 100.0% ||         84ms   0.7% | 100.0% | resource:/truffleruby/core/kernel.rb~170-172:5816-5898 
 String#include?                                              |        176ms   1.4% | 100.0% ||         74ms   0.6% | 100.0% | resource:/truffleruby/core/string.rb~150-152:4643-4748 
 Dir::Glob.path_join                                          |         69ms   0.5% |  97.1% ||         69ms   0.5% |  97.1% | resource:/truffleruby/core/dir_glob.rb~337-345:10022-10221 
 Array#<<                                                     |         55ms   0.4% |  98.2% ||         55ms   0.4% |  98.2% | (core)~1:0 
 Dir::Glob::Match#match?                                      |        374ms   3.0% |  98.9% ||         54ms   0.4% | 100.0% | resource:/truffleruby/core/dir_glob.rb~185-187:6066-6136 
 Dir::Glob::Node#path_join                                    |        137ms   1.1% |  89.8% ||         45ms   0.4% |  75.6% | resource:/truffleruby/core/dir_glob.rb~51-53:2110-2203 
 Dir#initialize_internal                                      |        706ms   5.6% |  97.9% ||         33ms   0.3% |  90.9% | resource:/truffleruby/core/dir.rb~48-63:2166-2582 
 String#encoding                                              |         27ms   0.2% | 100.0% ||         27ms   0.2% | 100.0% | resource:/truffleruby/core/string.rb~1521-1523:39949-40016 
 Dir::Glob::Node#separator                                    |         23ms   0.2% |  95.7% ||         23ms   0.2% |  95.7% | resource:/truffleruby/core/dir_glob.rb~47-49:2053-2107 
 Truffle::Type.rb_convert_type                                |         22ms   0.2% | 100.0% ||         22ms   0.2% | 100.0% | resource:/truffleruby/core/type.rb~263-273:9204-9581 
 String#force_encoding                                        |         21ms   0.2% | 100.0% ||         21ms   0.2% | 100.0% | (core)~1:0 
 Dir#close                                                    |        382ms   3.0% |  98.7% ||         19ms   0.2% |  89.5% | resource:/truffleruby/core/dir.rb~116-122:3530-3664 

Quite some time spent in native calls, which is expected. Maybe we can group some of them (e.g., do we need Truffle::POSIX.truffleposix_lstat_mode?)

Something else we see in the profile is the cost of sorting the returned entries.

chrisseaton commented 3 years ago

I'm wondering if it really makes the same native calls as MRI - like I can imagine it accidentally being written to stat each component of every path each time, rather than stating a component once and then stating its children once etc. Or something else like that.

eregon commented 3 years ago

We could potentially compare with strace for system calls. One way to avoid the extra lstat (seems used to find out if a given entry is a directory) would be to use readdir's d_type field, but:

               unsigned char  d_type;      /* Type of file; not supported
                                              by all filesystem types */

              Currently, only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type.  All applications must properly handle a return of DT_UNKNOWN.

So we'd need a fallback to lstat. And know whether that works on macOS too.

bjfish commented 3 years ago

@eregon I suspect d_type works on macOS after seeing some straces from MRI. There appears to be 2 extra lstat in TruffleRuby for every file path currently. I believe this is due to not using d_type.

We also appear to have some duplicated calls to POSIX.opendir on some directories. On mac I captured this CPU sample:

 Name                                                              |      Total Time     |  Opt % ||       Self Time     |  Opt % | Location             
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Truffle::POSIX.opendir                                            |       4327ms  43.8% |  96.4% ||       4285ms  43.4% |  96.8% | src/main/ruby/truffleruby/core/posix.rb~134-168:3839-5070 
 Truffle::POSIX.truffleposix_readdir                               |       2040ms  20.7% |  89.7% ||       1605ms  16.3% |  92.0% | src/main/ruby/truffleruby/core/posix.rb~134-168:3839-5070 
 Truffle::POSIX.truffleposix_lstat_mode                            |       1198ms  12.1% |  90.2% ||       1083ms  11.0% |  93.4% | src/main/ruby/truffleruby/core/posix.rb~134-168:3839-5070 
 Truffle::POSIX.closedir                                           |        473ms   4.8% |  91.8% ||        468ms   4.7% |  92.1% | src/main/ruby/truffleruby/core/posix.rb~134-168:3839-5070 
eregon commented 3 years ago

sample from strace output: CRuby:

openat(AT_FDCWD, "src/main/java/org/truffleruby/interop", O_RDONLY|O_CLOEXEC|O_DIRECTORY) = 8
fstat(8, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fcntl(8, F_GETFL)                       = 0x18000 (flags O_RDONLY|O_LARGEFILE|O_DIRECTORY)
fcntl(8, F_SETFD, FD_CLOEXEC)           = 0
getdents64(8, 0xd9e450 /* 19 entries */, 32768) = 880
getdents64(8, 0xd9e450 /* 0 entries */, 32768) = 0
close(8)                                = 0

TruffleRuby:

openat(AT_FDCWD, "src/main/java/org/truffleruby/interop", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 7
fstat(7, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
getdents64(7, 0x559ac7ad1780 /* 19 entries */, 32768) = 880
lstat("src/main/java/org/truffleruby/interop/InteropNodes.java", {st_mode=S_IFREG|0644, st_size=117806, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/BoxedValue.java", {st_mode=S_IFREG|0644, st_size=1187, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/TranslateInteropExceptionNode.java", {st_mode=S_IFREG|0644, st_size=6267, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/IsStringLikeNode.java", {st_mode=S_IFREG|0644, st_size=1589, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/RubyToForeignArgumentsNode.java", {st_mode=S_IFREG|0644, st_size=2043, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/SymbolToByteOrderNode.java", {st_mode=S_IFREG|0644, st_size=1793, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/InteropManager.java", {st_mode=S_IFREG|0644, st_size=1093, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/PolyglotNodes.java", {st_mode=S_IFREG|0644, st_size=7073, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/TranslateInteropRubyExceptionNode.java", {st_mode=S_IFREG|0644, st_size=9477, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/ToJavaStringWithDefaultNode.java", {st_mode=S_IFREG|0644, st_size=1404, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/ToJavaStringNode.java", {st_mode=S_IFREG|0644, st_size=4141, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/ForeignToRubyArgumentsNode.java", {st_mode=S_IFREG|0644, st_size=2418, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/ForeignToRubyNode.java", {st_mode=S_IFREG|0644, st_size=1588, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/FromJavaStringNode.java", {st_mode=S_IFREG|0644, st_size=2235, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/ProxyForeignObject.java", {st_mode=S_IFREG|0644, st_size=3029, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/OutgoingForeignCallNode.java", {st_mode=S_IFREG|0644, st_size=26132, ...}) = 0
lstat("src/main/java/org/truffleruby/interop/RubyToForeignNode.java", {st_mode=S_IFREG|0644, st_size=1368, ...}) = 0
getdents64(7, 0x559ac7ad1780 /* 0 entries */, 32768) = 0
close(7)                                = 0

So CRuby seems to use the trick I described in https://github.com/oracle/truffleruby/issues/2092#issuecomment-841286446, and that's probably a significant gain.

Probably we need some extra private method to do something like Dir#read + return whether it's a directory/the file type. We could also add a variant for efficiency which also automatically skip . and .. where that's the correct semantics.

aardvark179 commented 3 years ago

I've taken a bit of a look at this, and we'll need an alternative way to read directory entries (we currently return the name from libtruffleposix when we really want the directory entry structs. I think we're also opening and reading every directory twice, once to handle the recursive * part of the pattern, and then a second time to perform the leaf name matching `` part.

The cases where we do need to stat entries can probably be sped up a little by using fstataf and passing in the fd of the directory, but we'd need to test that to see how much of a difference that makes.

chrisseaton commented 3 years ago

Going between Ruby and POSIX calls means that they're doing a lot of overhead to allocate structures and function calls just to read fields and things. Maybe a richer function in libtruffleposix could do most of the globbing work staying in C?

aardvark179 commented 3 years ago

Possibly, but I'm not sure I want to be the one to write that implementation. :-) I'll have a think about what we could do to divide the work up in a smarter way, and I'll try benchmarking fstatat and see if that would make a difference, I think it might be sensible to switch anyway because we likely have some horrible race conditions if anybody changes current working directory in another thread while we're in the middle of globbing.

chrisseaton commented 3 years ago

Is this a case where we can copy and paste from MRI?

aardvark179 commented 3 years ago

Have you looked at the MRI implementation? Also, they definitely suffer from the race condition problem (because although they use fstatat they use AT_FDCWD as the base, so they are always working relative to the process's current working directory). Maybe the don't release the GVL during a glob and therefore know Ruby code can't change the cwd? I haven't checked that but yet.

chrisseaton commented 3 years ago

This seems to be the inner-most function, for reference.

https://github.com/ruby/ruby/blob/86b4c2fc3fb0ec815e212628f6708c5b777f523e/dir.c#L2255-L2599

Quite a lot of helpers around this.

aardvark179 commented 3 years ago

Okay, I've got a test implementation for using readdir_r and filling in a dirent structure to get the type along with the directory entry, bu on my machine it only helps a little bit as XFS returns DT_UNKNOWN for regular files so I have to stat them anyway, but it does tell me that directories are DT_DIRs so I does save some calls to lstat. On other file systems this won't be necessary (btrfs, ext3, and ext4 return the type information for every directory entry, I don't know about zfs or Apple's apfs). Overall this doesn't make a huge difference on my machine, but might help considerably on other systems.

Using fstatat makes a small difference, and is probably safer than lstat.

A bit more work to eliminate reading directories twice has more of an effect, getting us up to about 65% of MRI's performance.

Disabling sorting of the glob results speeds everything up by about 20% so might be worth doing. MRI doesn't sort, and the original justification was that Apple did sort and other platforms didn't. That was true for HFS+ volumes, but isn't true for APFS volumes, and I'm not sure if it's true for ISO volumes either.

There is still some more work we can do. We're ultimately using File.fnmatch to check for leaf matches, but this function can handle the full glob syntax, is behind a TruffleBoundary, and is interpreted on every call. We also spend a quite a bit of time checking paths etc. when creating directory objects, We should also work to collapse successive ** patterns etc.

nirvdrum commented 3 years ago

@aardvark179 I have a ZFS on Linux system running if you need me to test/verify anything.

aardvark179 commented 3 years ago

We have to stat anything that is returned with a DT_UNKNOWN type, and having dug through file system documentation I now understand why I saw a different number of stat calls to others.

chrisseaton commented 3 years ago

This is faster but still not perfect.

ruby 2.7.3p183 (2021-04-05 revision 6847ee089d) [x86_64-darwin20]
Warming up --------------------------------------
              Dir#[]    19.000  i/100ms
Calculating -------------------------------------
              Dir#[]    206.360  (± 9.7%) i/s -      1.026k in   5.019304s

truffleruby 21.2.0-dev-a50c8e52, like ruby 2.7.3, GraalVM CE JVM [x86_64-darwin]
Warming up --------------------------------------
              Dir#[]    12.000  i/100ms
Calculating -------------------------------------
              Dir#[]    175.075  (± 9.7%) i/s -    876.000  in   5.052381s

Is that what you saw as well, Duncan?

aardvark179 commented 2 years ago

I've taken another look at this, and it looks like most of the remaining difference is from sorting the array. I've just done a quick micro benchmark of sorting large string arrays, and MRI is over twice as fast, so it's probably worth taking a look at that.

aardvark179 commented 2 years ago

Improving the string comparison used in Array#sort! gets us closer to MRI's performance, and we're within 10% if sorting is disabled, but sorting is still more expensive on TruffleRuby. That's mostly because we sort all the entries once they have been matched, but MRI is doing the sort as it goes. I've tried refactoring our code to do something similar, but it cost us as much as we gained, so I'd need to do some additional work.