prometheus / client_ruby

Prometheus instrumentation library for Ruby applications
Apache License 2.0
515 stars 149 forks source link

Support pre-fork servers #9

Closed atombender closed 5 years ago

atombender commented 9 years ago

If you use this gem with a multi-process Rack server such as Unicorn, surely each worker will be returning just a percentage of the correct results (eg., number of requests served, total time), thus making the exposed metrics fairly meaningless?

To solve this the easiest solution is to create a block of shared memory in the master process that all workers share, instead of using instance variables.

brian-brazil commented 7 years ago

The push gateway is not appropriate for this use case at all. Statsd exporter would be the architecture for that.

christoph-buente commented 7 years ago

Thanks @brian-brazil, will have a look into that.

mcfilib commented 7 years ago

@SamSaffron did you ever get round to looking into writing that exporter based off raindrops?

SamSaffron commented 7 years ago

@filib I did but its running on internal code. Its a tiny amount of code actually:

something along the lines of:

net_stats = Raindrops::Linux::tcp_listener_stats("0.0.0.0:3000")["0.0.0.0:3000"] if net_stats puts "web_active:#{net_stats.active}" puts "web_queued:#{net_stats.queued}" end

Then we also have monitors that alert us if we are sitting at sustained queued for longer than a minute or so.

lyda commented 7 years ago

Note that @juliusv has pushed a branch that solves this. I think it needs more work to support all metric types.

grobie commented 7 years ago

Another note: I'll work on a complete rewrite of client_ruby likely to start in the next few weeks. It will address several of the library's architectural short comings and include the option to change the value handling (lock-free for single process programs, multi-threaded, multi-process). client_ruby is the oldest client library in its current state, we've learned a lot in the meantime and similar rewrites have been done in client_golang and client_java.

lyda commented 7 years ago

When doing this note that the exchangeable-value-types branch does multiprocess but has some flaws:

Be nice to have fixes for that as well.

grobie commented 7 years ago

Please note that v0.7.0.pre.rc.1 is available for some time with a rewrite of the Collector middleware which uses Histograms instead of Summaries.

On Thu, Mar 30, 2017 at 4:55 AM Kevin Lyda notifications@github.com wrote:

When doing this note that the exchangeable-value-types branch does multiprocess but has some flaws:

  • Not all metric types are supported. Specifically the summary type isn't supported.
  • This means that the Collector for rack isn't supported either.

Be nice to have fixes for that as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prometheus/client_ruby/issues/9#issuecomment-290331800, or mute the thread https://github.com/notifications/unsubscribe-auth/AAANaHVMHjwf_r8NFE9ddVQpFYhs9rNEks5rq1_igaJpZM4DdTKr .

brian-brazil commented 7 years ago

What doesn't work about Summaries? From a stats standpoint, it should be possible to get something out of them.

grobie commented 7 years ago

How do you aggregate per-process quantiles?

On Thu, Mar 30, 2017 at 8:56 AM Brian Brazil notifications@github.com wrote:

What doesn't work about Summaries? From a stats standpoint, it should be possible to get something out of them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prometheus/client_ruby/issues/9#issuecomment-290388941, or mute the thread https://github.com/notifications/unsubscribe-auth/AAANaPvP61YNmsU1Gi6dCkXxUnaopj0Cks5rq5hzgaJpZM4DdTKr .

juliusv commented 7 years ago

This is the branch where I put up (really hacky and experimental!) multiprocess support. It's in large part a transliteration of the Python client's multiprocess support, with tests hacked up to "pass". It's alpha-usable, but I'm hoping for @grobie's rewrite to do this all properly :) See https://github.com/prometheus/client_ruby/blob/multiprocess/example.rb

SamSaffron commented 7 years ago

wanted to share some pics so you can see what you get with raindrops:

image

Reqs are very evenly distributed between the 5 boxes (which is to be expected with haproxy) - we run about 8 workers per machine, everything is counted.

We see some queueing when there are spikes, which indicates we may need an extra process or two here

image

brian-brazil commented 7 years ago

How do you aggregate per-process quantiles?

You can't. But you can report them all.

The _sum and _count work fine in any case.

lyda commented 7 years ago

We've released this which is what we're using for Gitlab. Hopefully it will be of some use for @grobie. It's multiproc mode only so not ideal for the general case but hopefully will be of use for future work.

grobie commented 7 years ago

Thanks!

On Tue, May 30, 2017, 14:23 Kevin Lyda notifications@github.com wrote:

We've released this https://gitlab.com/gitlab-org/prometheus-client-mmap/commits/exchangeable-value-types which is what we're using for Gitlab. Hopefully it will be of some use for @grobie https://github.com/grobie. It's multiproc mode only so not ideal for the general case but hopefully will be of use for future work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prometheus/client_ruby/issues/9#issuecomment-304861553, or mute the thread https://github.com/notifications/unsubscribe-auth/AAANaA9xVDZucUNipW3QomyduFUxWD00ks5r_AotgaJpZM4DdTKr .

mrgordon commented 7 years ago

I'm guessing there is no progress on this which is sad as it basically makes metrics collections useless in any Rails environment with a forking web server.

I'm trying to switch to the commits that @juliusv and @lyda linked but the examples don't show how to use their changes with the Rack collector / exporter.

Is there anything needed other than:

Prometheus::Client::MmapedValue.set_pid(1)
use Prometheus::Middleware::Collector
use Prometheus::Middleware::Exporter

I tried doing that but hit errors where it was trying to call total on a hash at lib/prometheus/client/formats/text.rb:171:in 'histogram'

Full trace:

app error: undefined method `total' for #<Hash:0x0055d849b5d520> (NoMethodError)
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:171:in `histogram'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:150:in `representation'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:33:in `block (2 levels) in marshal'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:32:in `each'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:32:in `block in marshal'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:28:in `each'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/client/formats/text.rb:28:in `marshal'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/exporter.rb:69:in `respond_with'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/exporter.rb:30:in `call'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/collector.rb:37:in `block in call'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/collector.rb:77:in `trace'
/usr/local/bundle/bundler/gems/client_ruby-b4adef5a39fd/lib/prometheus/middleware/collector.rb:37:in `call'
/usr/local/bundle/gems/rack-1.4.7/lib/rack/content_length.rb:14:in `call'
/usr/local/bundle/gems/railties-3.2.22.5/lib/rails/rack/log_tailer.rb:17:in `call'
/usr/local/bundle/gems/unicorn-4.8.2/lib/unicorn/http_server.rb:572:in `process_client'

I reverted some value.get.total calls to value.total and I can get the app to load now but it looks like the metrics are still per fork as if I hit /metrics repeatedly I get different values.

https://github.com/mrgordon/client_ruby/commit/68ca030ee903a312627b451674c0562727f6fce4

I'm guessing due to my changes I also get some metrics back like this:

http_server_request_duration_seconds_count{method="get",path="/metrics"} #<Prometheus::Client::SimpleValue:0x005555afd28278>

I'll look more into that issue but I'm guessing the SimpleValue indicates the mmap stuff isn't working.

mrgordon commented 7 years ago

Quick update: I added prometheus_multiproc_dir=/tmp to the environment as I see the code checks for this even though none of the examples set it. My output from /metrics now shows mmap stuff but the http_server_requests_total still varies depending on which Unicorn thread I talk to.

e.g.

$ curl http://myapp/metrics
http_server_requests_total{code="200",method="get",path="/metrics"} 4.0
...
http_server_request_duration_seconds_bucket{method="get",path="/metrics",le="10"} 4.0
http_server_request_duration_seconds_bucket{method="get",path="/metrics",le="+Inf"} #<Prometheus::Client::MmapedValue:0x0055ba313dfc70>

a few seconds later

$ curl http://myapp/metrics
http_server_requests_total{code="200",method="get",path="/metrics"} 3.0
...
http_server_request_duration_seconds_bucket{method="get",path="/metrics",le="10"} 3.0
http_server_request_duration_seconds_bucket{method="get",path="/metrics",le="+Inf"} #<Prometheus::Client::MmapedValue:0x0055ba313da720>
lzap commented 6 years ago

What would be the best workaround for Rails app running in Passenger? I am considering statsd_exporter and pushgateway. I expect the latter to be easier to integrate with but slower.

SamSaffron commented 6 years ago

Discourse uses https://github.com/discourse/discourse-prometheus which is fork safe, but it would take quite a bit of wrangling to turn this into a reusable gem, the internal concepts are good though and this is deployed in production.

On Wed, Nov 8, 2017 at 9:45 PM, Lukáš Zapletal notifications@github.com wrote:

What would be the best workaround for Rails app running in Passenger? I am considering statsd_exporter and pushgateway. I expect the latter to be easier to integrate with but slower.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prometheus/client_ruby/issues/9#issuecomment-342780004, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAUXcoHhNqJMvo9ELgrBBz2hXeCtGY6ks5s0YYxgaJpZM4DdTKr .

kvitajakub commented 6 years ago

Hi, is there any progress on this please?

Looks like it should have issue or feature label in addition to the question.

SamSaffron commented 6 years ago

I started my Discourse extraction here:

https://github.com/discourse/prometheus_exporter

Supports multiple processes with optional custom transport, recovery and chunked encoding. Should be released proper in a few weeks.

Works fine with Passenger, Unicorn and Puma in clustered mode. You can also aggregate sidekiq.

lzap commented 6 years ago

Thanks. How many metrics do you send for each request in production? I am little bit concerned about that HTTP/JSON transport.

SamSaffron commented 6 years ago

@lzap will update the readme, I can send about 10k strings over in 200ms over a single request ... if you add a round trip of to_json and JSON.parse its around 600ms for 10k messages.

I persist the HTTP connection for 30 seconds, configurable

SamSaffron commented 6 years ago

Just putting this out here, in case anyone is still facing the problem

https://samsaffron.com/archive/2018/02/02/instrumenting-rails-with-prometheus

fajpunk commented 6 years ago

Hi! I notice the Help wanted tag on this issue, but it's unclear from reading the comments exactly what help is wanted; all I see is a few proposed solutions and alternate libraries. Has a solution been chosen, and if so, is there a branch where work is being done? I'd like to help if I can.

dmagliola commented 5 years ago

An update on this issue: We (GoCardless) are currently working on this. In the process, we're also planning to bring the Prometheus Client more in sync with the current best practices as recommended here.

We've opened an issue with an RFC explaining what we're planning to do. And we have a PR with the first steps.

This PR doesn't solve everything, and specifically it doesn't deal with the multi-process issue, but it's the first step in allowing that to happen, by introducing "pluggable backends" where the metric data is stored, allowing consumers of the library to pick which is best for their needs, and to write their own if necessary.

We appreciate comments and feedback on these proposals and look forward to working with you on this!

dmagliola commented 5 years ago

Another update: we've updated PR #95 to (among a lot of other things) add a "backend" that does work with pre-fork servers, sharing data between processes on a scrape. We've also done plenty of benchmarking and performance improvements to it, and would love the everyone's comments on it!

The new backend we're proposing it's pretty simple. It basically uses files to keep the data, but thanks to the wonders of file system caching, we're not really blocking on disks, and it's actually surprisingly fast. A counter increment with this backend is about 9μs. For context:

We still haven't abandoned the idea of using MMaps in the near future, but we've found some stability issues with that solution, whereas the File-based backend is very stable and reliable, so we're proposing to go with that for the time being.

NOTE: It's a big PR, it was a big refactoring, and there's a lot going on. However, the commits are coherent and have good descriptions, so we recommend reading it commit-by-commit rather than the full diff all at once, which will not be very parseable.

philandstuff commented 5 years ago

Is this fixed by #95 now?

dmagliola commented 5 years ago

@philandstuff In short, yes, but the latest release still doesn't reflect that. There's still a few changes planned before cutting a new release, some of which are breaking changes, so we'd like to get them done before that. But... Soon :-)

dmagliola commented 5 years ago

We're happy to report, this is now fixed! 🎉 (Sorry for the delay, we should've updated this a while ago)

We haven't yet launched a 1.0 (coming soon!), but you can start using it with our recent v0.10.0.pre.alpha.2 release.

NOTE: We expect there might still be a couple of small breaking changes in the API between now and the 1.0, but we've been using this code in production for months and it works great, so if you are OK with potentially having to do minor adjustments once we release 1.0, you can already use this.

wshihadeh commented 4 years ago

@dmagliola does the fix also consider deploying the service in a HA mode. in that case, the running process do not have access to the shared memory :(

dmagliola commented 4 years ago

@wshihadeh I'm not sure I understand the question fully... HA mode would be different servers, correct? In that case, each server would have separate time series in Prometheus anyway, and the aggregation happens when querying Prometheus.

Or are you talking about something different?

wshihadeh commented 4 years ago

Yes, I am talking about that. To explain it more , suppose that the application is a web server that is deployed in multiple containers or servers and it exposes the metrics endpoint. I don't see why we need two Prometheus tasks for such setup (one is enough that scrape metrics from the load balancer or service IP). In that case, the metrics will not be totally accurate and that is why I think there is a need for using shared storage between the containers or servers something like Redis

philandstuff commented 4 years ago

I don't see why we need two Prometheus tasks for such setup

Prometheus really really wants to scrape individual instances. There are a number of reasons for this but the main one is that this is the overwhelming convention in the Prometheus world. If you want to break this convention you will find yourself fighting the tooling instead of working with it.

You should use some sort of service discovery to ensure that Prometheus automatically detects new instances when they are created. That way, you only need a single scrape_config but Prometheus will still scrape each instance separately.