Closed rdoering closed 2 years ago
@rdoering, I guess that you can try something like:
count(count(last_over_time(rundeck_project_execution_status[$__range])) by (project_name, execution_id)) by (project_name)
I'm not sure why exactly, but my query response ls invalid parameter \"query\": 1:13: parse error: unknown function with name \"last_over_time\"
.
But I expect and the docu, that the function has to be there. Therefor I will try to update prometheus.
Oh, I see. In that case you can try any other *_over_time
function, for example max_over_time
wow, that was fast. I updated prometheus to v2.36.2 and now, it is working.
Currently the results are not related to the rundeck figures on the overview page for the last day. But I will verify it.
Yeah, It's not so accurate but I guess that it's almost the way to get the expected result.
1471 vs 369
I will discover the reason later, because I have to leave :-)
Thanks for your support @phsmith
Oh man, that's too much! No worries, @rdoering, I keep you updated if I figure out a better query.
Here the result of my investigation.
I am using the query sum(last_over_time(rundeck_project_execution_status{status=~"(succeeded|failed)", project_name="hosted_enterprise_siemens_prod"}[1h])) by (project_name, instance, execution_id)
to see the executions
and sum(last_over_time(rundeck_project_execution_status{status=~"(succeeded|failed)", project_name="hosted_enterprise_siemens_prod"}[1h])) by (project_name, instance)
to get the figures only.
I compared the results with the API call GET {{url}}/api/17/project/
1) I figured out, that the newest execution_id is missing sometimes. I guess the reason is, that prometheus didn't fetched this execution yet.
2) I figured out, that some (circa 5) executions were older than the expected time range of one hour. I think the reason for this is that these remarks have not yet been forgotten by exporter or Prometheus.
3) Some executions are missing in the metrics. The starting point varies greatly and is therefore probably not to be taken into account in the cause analysis. I suspect the duration is relevant here, as all the executions examined ran for only a few seconds. I therefore suspect that these executions were started and ended between two exporter-API calls and were therefore not recorded by the exporter.
"date-started": {
"unixtime": 1658258806011,
"date": "2022-07-19T19:26:46Z"
},
"date-ended": {
"unixtime": 1658258810777,
"date": "2022-07-19T19:26:50Z"
},
---
"date-started": {
"unixtime": 1658258386010,
"date": "2022-07-19T19:19:46Z"
},
"date-ended": {
"unixtime": 1658258390714,
"date": "2022-07-19T19:19:50Z"
},
---
"date-started": {
"unixtime": 1658257546011,
"date": "2022-07-19T19:05:46Z"
},
"date-ended": {
"unixtime": 1658257550816,
"date": "2022-07-19T19:05:50Z"
},
The issues 1) and 2) are not the relevant to me, because a fixed offset seems to be ok. But the 3. issue is a show stopper, because the deviation would increase with a higher number of short running executions.
For me, this means that I cannot rely on these figures.
@rdoering thanks for that investigation.
I presuming that the problem is related to this call: https://github.com/phsmith/rundeck_exporter/blob/9929bcce08207dc435aa75b106d0c5e732e9d5f6/rundeck_exporter.py#L223 This endpoint return a maximum of 20 results for default, so if there's too much executions in a short time it'll not shows all the data.
I changed the max results limit so it should retrieve more data and probably solve the problem.
When you have a chance, please, try the latest exporter version.
I will deploy the latest image, wait 2h and look again.
If this helps, we could add a option for "max" executions or more sophisticated a pagination algorithm.
I pulled latest exporter and here are two ordered lists of execution id. prom.txt scratch_83.txt
1) looking for all execution ids in prom.txt but not in scratch (marked yellow):
2) 1) looking for all execution ids in scratch but not in prom.txt(marked yellow): (!) There is one mismatch in the bottom, too.
The most executions aren't that long, but the executions, the exporter missed, are very short. Here alle executions from the scratch file with start and end times scratch_83_timed.txt
As the executions are included in the api responses, I used to fetch, there seems to be another issue.
Oh man, I see... There's something else that needs investigation. As soon as possible I'll try to debug it.
Hey @rdoering,
I found the problem: https://github.com/phsmith/rundeck_exporter/blob/d48ea0c480eeaffdbb12e206532ceda2dccc90ea/rundeck_exporter.py#L246-L249
This old block of code was the responsible for not shown all the projects executions results as expected.
Also, I've added a new parameter rundeck.projects.executions.limit
so you can control how much results to retrieve at once, kept it as default 20.
Please, test the latest version, v2.4.14, when you have a chance.
I Updated the images and restarted the container and will verify it.
Much better: 95 (API) vs 117 (exporter) for 1h.
In a project with many more executions: 2301 (API) vs 493 (exporter) for 1h. I set RUNDECK_PROJECTS_EXECUTIONS_LIMIT to 300 And will verify it again, later.
There is no execution missing in between but hudge dela. I guess 300 is way to large :-/.
Oh, great to know!
Yeah, if you have a lot of projects, then higher limits is going to be a problem.
Maybe you could try RUNDECK_PROJECTS_EXECUTIONS_CACHE=True
or increase the RUNDECK_PROJECTS_EXECUTIONS_LIMIT
gradually.
I think the right way to get the metrics is using execution-query-metrics API.
I think the right way to get the metrics is using execution-query-metrics API.
I've introduced a new metric rundeck_project_executions_total
.
Was able to get the total project executions from the endpoint /project/{project_name}/executions
that's already used in the exporter.
Closing this issue. If the problem was not solved, feel free to re-open it.
I would like to count the number of executions grouped by project_name for further calculations.
I thought about
count(rundeck_project_execution_status) by (project_name)
but I am not able to put this into a graph.