Open kevinkle opened 3 years ago
Hi,
Thank you for this pull request.
About your decisions, I really can't comment on these decisions as I have very little experience in this metrics/Prometheus field and this was just a hobby project, but I don't mind merging this pull request if these new metrics will be disabled by default.
I would say I'm still uncertain about using a Summary Metric
Me too I'm uncertain about using a Summary in this situation.
About the flag, we need to add a configuration option in the config.py
module and let it be False
by default, we also need to add an argument in __main__.parse_args
maybe we can call it --job-runtime
to enable the job runtime metrics, and then we can pass this configuration value to the RQCollector
class in the __main__.main
and exporter.create_app
functions.
And please you also need to check any failing unit tests and add tests for the new functions.
Thank you.
Any plan on reconsidering this @mdawar / @kevinkle ?
@xlr-8 I'm personally no longer using the exporter, just maintaining it and I have no plans to add this feature unless we have a fully working PR with tests and preferably with a flag to enable these features especially if they're expensive so we don't affect users that don't need this feature.
Per #3 , thought I'd open a PR and we can move specific discussions here. Can consider this an early draft, and here's what the added panels look like atm.
A short running job was added to show the difference.
Some decisions that were made:
Using
SummaryMetricFamily
- the other options would have beenGaugeMetricFamily
orHistogramMetricFamily
. Histograms require pre-defined buckets which I don't think would be a good fit as the runtimes wouldn't be generic, instead a Summary Metric uses rolling time windows. It's adding the data individually per scraped job so thecount_value
is1
and thesum_value
is only what that job's calculated runtime was; so this is a bit strange - Gauge Metrics may work instead.Currently it scrapes the 3 most recently completed jobs per queue which gives the above panels. Will look at option flags for this.
Timestamps are specified when it adds job runtime data, this is taken from the
job.ended_at
. This does have the effect that since Prometheus is append-only, that when we scrape jobs which completed prior to the last scrape, those old jobs will not be added. Instead Prometheus will throw anError on ingesting out-of-order samples
and drop them. I believe this should mean that it never stores jobs as duplicates and only displays the latest data. Could also use a better approach for this.Data labels are the
job.func_name
and the queue.Still left to do:
enqueued_at
in a different metricOther things:
To you earlier point https://github.com/mdawar/rq-exporter/issues/3#issuecomment-667305691, if the job completes and is removed before a scrape, there is still no information about it. So scrape times have to be within the jobs ttl which should be the case unless the ttl was manually specified to be pretty short.
I would say I'm still uncertain about using a Summary Metric - I think it's meant for when the exporter is able to calculate something like total response time for a whole group of responses? In other words, when the
count_value
is > 1, and thesum_value
would be the sum of these individual samples. Runningavg()
on this metric seems to make logical sense for now:long_running_job
s complete withinrandom.randint(2, 10)
as displayed, andshort_running_jobs
s complete within that div 10.I think using timestamps does get around issues with the data being inaccurate, as it should have only the latest jobs completed since last scrape. If that works, it should mean not having to scrape all jobs from the finished and failed registries and instead focus on getting a "what's the performance like right now" question.
Anyways, let me know your thoughts and can go from there.