Closed splattael closed 1 year ago
I'm not opposed to this change, but:
Keeping in mind my comment that this may or may not be a greatly realistic test, I'm getting this on my Macbook M2 Air:
Warming up --------------------------------------
counter main 26.202k i/100ms
Calculating -------------------------------------
counter main 261.357k (± 0.4%) i/s - 1.310M in 5.012766s
Comparison:
counter perf-inc: 265807.5 i/s
counter main: 261356.7 i/s - 1.02x slower
Warming up --------------------------------------
gauge main 37.286k i/100ms
Calculating -------------------------------------
gauge main 336.417k (±10.3%) i/s - 1.678M in 5.040241s
Comparison:
gauge main: 336417.0 i/s
gauge perf-inc: 320784.3 i/s - same-ish: difference falls within error
Warming up --------------------------------------
histogram main 9.862k i/100ms
Calculating -------------------------------------
histogram main 98.633k (± 1.2%) i/s - 493.100k in 5.000117s
Comparison:
histogram perf-inc: 101441.5 i/s
histogram main: 98633.1 i/s - 1.03x slower
So, 3% faster on histograms, 2% faster on counter, same on gauges. Second run: 2% on counter, same on gauges and hist On a third run, I got a freakishly high result for gauges (23%) 😳
I'm not sure how much to trust this benchmark... :/
Doing:
st = internal_store
value = st.read_value(key)
st.write_value(key, value + by.to_f)
(so, sidestepping the double call to pid
)
I get more or less the same result (so, either "same", or within 1 to 4% improvement to performance)
I'm going to try to see if I remember how my "heavy" benchmarking code worked, and see what results I get, but again, keeping in mind that testing on a Macbook is probably not super representative.
In the meantime, i'd recommend doing just the change to sidestep the double call to internal_store
, which adds almost zero complexity, and doesn't hurt, really :)
Ok, I've changed my mind :)
I've run the datastores benchmarks, on a somewhat random m6i.2xlarge on EC2 that I have at hand, and they seem to be consistently faster. About 6% indeed.
Doing the caching of the variable as I was suggesting is also faster than baseline, but only like 2% (ish)
These seem pretty repeatable, and they consistently give me those results.
So I'm convinced, we should merge this.
@Sinjo, any objections to this?
Also, sorry, what I should've started with.
Thank you for looking into this and for the PR @splattael! :D
@dmagliola Thanks a ton for reviewing and merging :bow:
You are right regarding benchmarks - they were flaky for me locally too (on Linux) but in the end they gave me overall speed improvement after running several times. This improvement makes sense (not only because of side-stepping Process.pid
) but also having less calls.
BTW, I am planning to reduce the amount of getpid
(via Process.pid
) syscalls by setting the PID on startup and override Kernel.fork
(or Process._fork
in Ruby 3.1+). See https://bugs.ruby-lang.org/issues/5446 and https://bugs.ruby-lang.org/issues/17795 for more context.
:wave: @SuperQ :heart:
In the meantime, i'd recommend doing just the change to sidestep the double call to internal_store, which adds almost zero complexity, and doesn't hurt, really :)
I had this very refactoring ready in another branch but saw less speed improvements. :sweat_smile:
I should have provided more context before, sorry about :bow:
No worries, and thanks for the PR! I'm going to wait on @Sinjo to confirm he's cool with this, but this should go out in the next release :)
I am planning to reduce the amount of getpid (via Process.pid) syscalls by setting the PID on startup and override Kernel.fork (or Process._fork in Ruby 3.1+).
I may be misreading the docs here, but is this a method intended to be monkeypatched? 😅
That makes me a bit uncomfortable, but it seems like it's designed precisely to do that... It's unfortunate the underscore in the name makes it hard to google for examples, all I can find is the test case where it gets introduced...
So I guess we could do that prepend
in the constructor of DirectFileStore
.
Would that look like:
def _fork
super
DirectFileStore.process_pid = Process.pid
end
Am I understanding that correctly?
@dmagliola Yes, is the way I understood it as well :+1:
It seems that Rails, for example, already does exactly this for Ruby 3.1+. See https://github.com/rails/rails/blob/v7.0.2/activesupport/lib/active_support/fork_tracker.rb#L50
Ah, that's amazing! You found exactly the kind of example I was hoping to get! 🙌 That sounds like a sensible thing to do, and i'm looking forward to the PR!
NOTE: I would do "half" of what Rails is doing. I'd prepend if _fork
is defined, but if not (if Ruby < 3.1), i'd leave it alone and continue checking for Process.pid
on every increment. So you get the ~4% perf boost if you're on a modern Ruby version only.
The logic here being that _fork
is intended for you to do this, but the other 3 aren't. I'm open minded about this. The argument "look, Rails does it and here's exactly how to do it safely" is pretty compelling, I just think we should limit how much we pollute the global namespace with "surprising" stuff.
Thank you @splattael for the perf improvement, and looking forward to future PRs! 🙌
@dmagliola @Sinjo I agree! We should use _fork
automatically if available :+1:
I wonder, though, if we could provide a way for users to opt-in to use cached process ids (via :monkey: patching Process.fork
)? :thinking:
So:
_fork
hookWDYT?
FileMappedDict#increment_value
Process.pid
is slow on Linux)Benchmarks
TLDR; Speed up counter and histogram by ~6-7%.