Optimize incrementing values in DirectFileStore adapter

splattael commented 1 year ago

Introduce FileMappedDict#increment_value
Check for internal_storage only once (Process.pid is slow on Linux)
Lookup file position only once

Benchmarks

TLDR; Speed up counter and histogram by ~6-7%.

Warming up --------------------------------------
    counter perf-inc    22.380k i/100ms
Calculating -------------------------------------
    counter perf-inc    224.104k (± 0.8%) i/s -      1.141M in   5.093434s

Comparison:
    counter perf-inc:   224103.9 i/s
        counter main:   212032.1 i/s - 1.06x  slower

Warming up --------------------------------------
      gauge perf-inc    29.079k i/100ms
Calculating -------------------------------------
      gauge perf-inc    289.211k (± 1.8%) i/s -      1.454M in   5.029133s

Comparison:
          gauge main:   292329.3 i/s
      gauge perf-inc:   289210.9 i/s - same-ish: difference falls within error

Warming up --------------------------------------
  histogram perf-inc     8.522k i/100ms
Calculating -------------------------------------
  histogram perf-inc     84.796k (± 2.8%) i/s -    426.100k in   5.029627s

Comparison:
  histogram perf-inc:    84795.9 i/s
      histogram main:    80946.5 i/s - same-ish: difference falls within error

# frozen_string_literal: true

require 'prometheus/client'
require 'prometheus/client/formats/text.rb'
require 'prometheus/client/data_stores/direct_file_store.rb'
require 'pp'
require 'benchmark/ips'

FileUtils.rm_rf "/tmp/prometheus_benchmark"
FileUtils.mkdir_p "/tmp/prometheus_benchmark"

Prometheus::Client.config.data_store =
  Prometheus::Client::DataStores::DirectFileStore.new(
    dir: "/tmp/prometheus_benchmark"
  )

# pp Prometheus::Client.config

prometheus = Prometheus::Client.registry

counter = Prometheus::Client::Counter.new(:mycounter,
  docstring: 'Example counter',
  labels: [:foo])

gauge = Prometheus::Client::Gauge.new(:mygauge,
  docstring: 'Example gauge',
  labels: [:foo],
  store_settings: { aggregation: :sum })

histogram = Prometheus::Client::Histogram.new(:myhistogram,
  docstring: 'Example histogram',
  labels: [:foo],
  buckets: [0, 1, 2])

prometheus.register(counter)
prometheus.register(gauge)
prometheus.register(histogram)

labels = {'foo': 'bar'}

branch = `git rev-parse --abbrev-ref HEAD`.chomp

Benchmark.ips do |x|
  x.report("counter #{branch}") do
    counter.increment(by: 1, labels: labels)
  end

  x.compare!
  x.save! 'counter'
end

Benchmark.ips do |x|
  x.report("gauge #{branch}") do
    gauge.set(1, labels: labels)
  end

  x.compare!
  x.save! 'gauge'
end

Benchmark.ips do |x|
  x.report("histogram #{branch}") do
    histogram.observe(1, labels: labels)
  end

  x.compare!
  x.save! 'histogram'
end

dmagliola commented 1 year ago

I'm not opposed to this change, but:

I'm not sure it actually makes a measurable performance difference, two out of 3 benchmarks report no change. Note that in my experience, testing performance of the disk store is very hard to get meaningful results. We used to use a cloudvm to get realistic / repeatable results, testing locally was all over the place.
sounds like what's you're trying to avoid is the double call to internal_store. You could avoid that by assigning the result of internal_store to a local var, then you don't need the new method in FileMappedDict.
avoiding the double lookup in the positions hash should be below measurability. Especially when compared to writing to the file...

dmagliola commented 1 year ago

Keeping in mind my comment that this may or may not be a greatly realistic test, I'm getting this on my Macbook M2 Air:

Warming up --------------------------------------
        counter main    26.202k i/100ms
Calculating -------------------------------------
        counter main    261.357k (± 0.4%) i/s -      1.310M in   5.012766s

Comparison:
    counter perf-inc:   265807.5 i/s
        counter main:   261356.7 i/s - 1.02x  slower

Warming up --------------------------------------
          gauge main    37.286k i/100ms
Calculating -------------------------------------
          gauge main    336.417k (±10.3%) i/s -      1.678M in   5.040241s

Comparison:
          gauge main:   336417.0 i/s
      gauge perf-inc:   320784.3 i/s - same-ish: difference falls within error

Warming up --------------------------------------
      histogram main     9.862k i/100ms
Calculating -------------------------------------
      histogram main     98.633k (± 1.2%) i/s -    493.100k in   5.000117s

Comparison:
  histogram perf-inc:   101441.5 i/s
      histogram main:    98633.1 i/s - 1.03x  slower

So, 3% faster on histograms, 2% faster on counter, same on gauges. Second run: 2% on counter, same on gauges and hist On a third run, I got a freakishly high result for gauges (23%) 😳

I'm not sure how much to trust this benchmark... :/

Doing:

              st = internal_store
              value = st.read_value(key)
              st.write_value(key, value + by.to_f)

(so, sidestepping the double call to pid)

I get more or less the same result (so, either "same", or within 1 to 4% improvement to performance)

I'm going to try to see if I remember how my "heavy" benchmarking code worked, and see what results I get, but again, keeping in mind that testing on a Macbook is probably not super representative.

In the meantime, i'd recommend doing just the change to sidestep the double call to internal_store, which adds almost zero complexity, and doesn't hurt, really :)

dmagliola commented 1 year ago

Ok, I've changed my mind :)

I've run the datastores benchmarks, on a somewhat random m6i.2xlarge on EC2 that I have at hand, and they seem to be consistently faster. About 6% indeed.

Doing the caching of the variable as I was suggesting is also faster than baseline, but only like 2% (ish)

These seem pretty repeatable, and they consistently give me those results.

So I'm convinced, we should merge this.

@Sinjo, any objections to this?

dmagliola commented 1 year ago

Also, sorry, what I should've started with.

Thank you for looking into this and for the PR @splattael! :D

splattael commented 1 year ago

@dmagliola Thanks a ton for reviewing and merging :bow:

You are right regarding benchmarks - they were flaky for me locally too (on Linux) but in the end they gave me overall speed improvement after running several times. This improvement makes sense (not only because of side-stepping Process.pid) but also having less calls.

BTW, I am planning to reduce the amount of getpid (via Process.pid) syscalls by setting the PID on startup and override Kernel.fork (or Process._fork in Ruby 3.1+). See https://bugs.ruby-lang.org/issues/5446 and https://bugs.ruby-lang.org/issues/17795 for more context.

:wave: @SuperQ :heart:

splattael commented 1 year ago

In the meantime, i'd recommend doing just the change to sidestep the double call to internal_store, which adds almost zero complexity, and doesn't hurt, really :)

I had this very refactoring ready in another branch but saw less speed improvements. :sweat_smile:

I should have provided more context before, sorry about :bow:

dmagliola commented 1 year ago

No worries, and thanks for the PR! I'm going to wait on @Sinjo to confirm he's cool with this, but this should go out in the next release :)

I am planning to reduce the amount of getpid (via Process.pid) syscalls by setting the PID on startup and override Kernel.fork (or Process._fork in Ruby 3.1+).

I may be misreading the docs here, but is this a method intended to be monkeypatched? 😅

That makes me a bit uncomfortable, but it seems like it's designed precisely to do that... It's unfortunate the underscore in the name makes it hard to google for examples, all I can find is the test case where it gets introduced...

So I guess we could do that prepend in the constructor of DirectFileStore.

Would that look like:

def _fork
  super
  DirectFileStore.process_pid = Process.pid 
end

Am I understanding that correctly?

splattael commented 1 year ago

@dmagliola Yes, is the way I understood it as well :+1:

It seems that Rails, for example, already does exactly this for Ruby 3.1+. See https://github.com/rails/rails/blob/v7.0.2/activesupport/lib/active_support/fork_tracker.rb#L50

dmagliola commented 1 year ago

Ah, that's amazing! You found exactly the kind of example I was hoping to get! 🙌 That sounds like a sensible thing to do, and i'm looking forward to the PR!

NOTE: I would do "half" of what Rails is doing. I'd prepend if _fork is defined, but if not (if Ruby < 3.1), i'd leave it alone and continue checking for Process.pid on every increment. So you get the ~4% perf boost if you're on a modern Ruby version only.

The logic here being that _fork is intended for you to do this, but the other 3 aren't. I'm open minded about this. The argument "look, Rails does it and here's exactly how to do it safely" is pretty compelling, I just think we should limit how much we pollute the global namespace with "surprising" stuff.

dmagliola commented 1 year ago

Thank you @splattael for the perf improvement, and looking forward to future PRs! 🙌

splattael commented 1 year ago

@dmagliola @Sinjo I agree! We should use _fork automatically if available :+1:

I wonder, though, if we could provide a way for users to opt-in to use cached process ids (via :monkey: patching Process.fork)? :thinking:

So:

For Ruby 3.1+ - cached process id on by default via _fork hook
For <Ruby 3.1 - off by default but opt-in allowed if setup code is called explicitly

WDYT?

prometheus / client_ruby

Optimize incrementing values in DirectFileStore adapter #280

Benchmarks