mdawar / rq-exporter

Prometheus metrics exporter for Python RQ (Redis Queue).
MIT License
65 stars 28 forks source link

Export job start/end times (#3) #4

Open kevinkle opened 3 years ago

kevinkle commented 3 years ago

Per #3 , thought I'd open a PR and we can move specific discussions here. Can consider this an early draft, and here's what the added panels look like atm.

runtime panels

A short running job was added to show the difference.

Some decisions that were made:

Using SummaryMetricFamily - the other options would have been GaugeMetricFamily or HistogramMetricFamily. Histograms require pre-defined buckets which I don't think would be a good fit as the runtimes wouldn't be generic, instead a Summary Metric uses rolling time windows. It's adding the data individually per scraped job so the count_value is 1 and the sum_value is only what that job's calculated runtime was; so this is a bit strange - Gauge Metrics may work instead.

Currently it scrapes the 3 most recently completed jobs per queue which gives the above panels. Will look at option flags for this.

Timestamps are specified when it adds job runtime data, this is taken from the job.ended_at. This does have the effect that since Prometheus is append-only, that when we scrape jobs which completed prior to the last scrape, those old jobs will not be added. Instead Prometheus will throw an Error on ingesting out-of-order samples and drop them. I believe this should mean that it never stores jobs as duplicates and only displays the latest data. Could also use a better approach for this.

Data labels are the job.func_name and the queue.

Still left to do:

Other things:

To you earlier point https://github.com/mdawar/rq-exporter/issues/3#issuecomment-667305691, if the job completes and is removed before a scrape, there is still no information about it. So scrape times have to be within the jobs ttl which should be the case unless the ttl was manually specified to be pretty short.

I would say I'm still uncertain about using a Summary Metric - I think it's meant for when the exporter is able to calculate something like total response time for a whole group of responses? In other words, when the count_value is > 1, and the sum_value would be the sum of these individual samples. Running avg() on this metric seems to make logical sense for now: long_running_jobs complete within random.randint(2, 10) as displayed, and short_running_jobss complete within that div 10.

I think using timestamps does get around issues with the data being inaccurate, as it should have only the latest jobs completed since last scrape. If that works, it should mean not having to scrape all jobs from the finished and failed registries and instead focus on getting a "what's the performance like right now" question.

Anyways, let me know your thoughts and can go from there.

mdawar commented 3 years ago

Hi,

Thank you for this pull request.

About your decisions, I really can't comment on these decisions as I have very little experience in this metrics/Prometheus field and this was just a hobby project, but I don't mind merging this pull request if these new metrics will be disabled by default.

I would say I'm still uncertain about using a Summary Metric

Me too I'm uncertain about using a Summary in this situation.

About the flag, we need to add a configuration option in the config.py module and let it be False by default, we also need to add an argument in __main__.parse_args maybe we can call it --job-runtime to enable the job runtime metrics, and then we can pass this configuration value to the RQCollector class in the __main__.main and exporter.create_app functions.

And please you also need to check any failing unit tests and add tests for the new functions.

Thank you.

xlr-8 commented 11 months ago

Any plan on reconsidering this @mdawar / @kevinkle ?

mdawar commented 11 months ago

@xlr-8 I'm personally no longer using the exporter, just maintaining it and I have no plans to add this feature unless we have a fully working PR with tests and preferably with a flag to enable these features especially if they're expensive so we don't affect users that don't need this feature.