phsmith / rundeck_exporter

Rundeck Metrics Exporter
GNU General Public License v3.0
58 stars 25 forks source link

I try to count recent executions per project #56

Closed rdoering closed 2 years ago

rdoering commented 2 years ago

I would like to count the number of executions grouped by project_name for further calculations.

I thought about count(rundeck_project_execution_status) by (project_name) but I am not able to put this into a graph.

phsmith commented 2 years ago

@rdoering, I guess that you can try something like:

count(count(last_over_time(rundeck_project_execution_status[$__range])) by (project_name, execution_id)) by (project_name)
rdoering commented 2 years ago

I'm not sure why exactly, but my query response ls invalid parameter \"query\": 1:13: parse error: unknown function with name \"last_over_time\".

But I expect and the docu, that the function has to be there. Therefor I will try to update prometheus.

phsmith commented 2 years ago

Oh, I see. In that case you can try any other *_over_time function, for example max_over_time

rdoering commented 2 years ago

wow, that was fast. I updated prometheus to v2.36.2 and now, it is working.

Currently the results are not related to the rundeck figures on the overview page for the last day. But I will verify it.

phsmith commented 2 years ago

Yeah, It's not so accurate but I guess that it's almost the way to get the expected result.

rdoering commented 2 years ago

1471 vs 369

rdoering commented 2 years ago

I will discover the reason later, because I have to leave :-)

Thanks for your support @phsmith

phsmith commented 2 years ago

Oh man, that's too much! No worries, @rdoering, I keep you updated if I figure out a better query.

rdoering commented 2 years ago

Here the result of my investigation.

I am using the query sum(last_over_time(rundeck_project_execution_status{status=~"(succeeded|failed)", project_name="hosted_enterprise_siemens_prod"}[1h])) by (project_name, instance, execution_id) to see the executions and sum(last_over_time(rundeck_project_execution_status{status=~"(succeeded|failed)", project_name="hosted_enterprise_siemens_prod"}[1h])) by (project_name, instance) to get the figures only.

I compared the results with the API call GET {{url}}/api/17/project//executions?recentFilter=1h&max=400

1) I figured out, that the newest execution_id is missing sometimes. I guess the reason is, that prometheus didn't fetched this execution yet.

2) I figured out, that some (circa 5) executions were older than the expected time range of one hour. I think the reason for this is that these remarks have not yet been forgotten by exporter or Prometheus.

3) Some executions are missing in the metrics. The starting point varies greatly and is therefore probably not to be taken into account in the cause analysis. I suspect the duration is relevant here, as all the executions examined ran for only a few seconds. I therefore suspect that these executions were started and ended between two exporter-API calls and were therefore not recorded by the exporter.

"date-started": {
                "unixtime": 1658258806011,
                "date": "2022-07-19T19:26:46Z"
            },
            "date-ended": {
                "unixtime": 1658258810777,
                "date": "2022-07-19T19:26:50Z"
            },

---
"date-started": {
                "unixtime": 1658258386010,
                "date": "2022-07-19T19:19:46Z"
            },
            "date-ended": {
                "unixtime": 1658258390714,
                "date": "2022-07-19T19:19:50Z"
            },
---
"date-started": {
                "unixtime": 1658257546011,
                "date": "2022-07-19T19:05:46Z"
            },
            "date-ended": {
                "unixtime": 1658257550816,
                "date": "2022-07-19T19:05:50Z"
            },

The issues 1) and 2) are not the relevant to me, because a fixed offset seems to be ok. But the 3. issue is a show stopper, because the deviation would increase with a higher number of short running executions.

For me, this means that I cannot rely on these figures.

phsmith commented 2 years ago

@rdoering thanks for that investigation.

I presuming that the problem is related to this call: https://github.com/phsmith/rundeck_exporter/blob/9929bcce08207dc435aa75b106d0c5e732e9d5f6/rundeck_exporter.py#L223 This endpoint return a maximum of 20 results for default, so if there's too much executions in a short time it'll not shows all the data.

I changed the max results limit so it should retrieve more data and probably solve the problem.

When you have a chance, please, try the latest exporter version.

rdoering commented 2 years ago

I will deploy the latest image, wait 2h and look again.

If this helps, we could add a option for "max" executions or more sophisticated a pagination algorithm.

rdoering commented 2 years ago

I pulled latest exporter and here are two ordered lists of execution id. prom.txt scratch_83.txt

1) looking for all execution ids in prom.txt but not in scratch (marked yellow): image

2) 1) looking for all execution ids in scratch but not in prom.txt(marked yellow): image (!) There is one mismatch in the bottom, too.

The most executions aren't that long, but the executions, the exporter missed, are very short. Here alle executions from the scratch file with start and end times scratch_83_timed.txt

As the executions are included in the api responses, I used to fetch, there seems to be another issue.

phsmith commented 2 years ago

Oh man, I see... There's something else that needs investigation. As soon as possible I'll try to debug it.

phsmith commented 2 years ago

Hey @rdoering,

I found the problem: https://github.com/phsmith/rundeck_exporter/blob/d48ea0c480eeaffdbb12e206532ceda2dccc90ea/rundeck_exporter.py#L246-L249

This old block of code was the responsible for not shown all the projects executions results as expected.

Also, I've added a new parameter rundeck.projects.executions.limit so you can control how much results to retrieve at once, kept it as default 20.

Please, test the latest version, v2.4.14, when you have a chance.

rdoering commented 2 years ago

I Updated the images and restarted the container and will verify it.

rdoering commented 2 years ago

Much better: 95 (API) vs 117 (exporter) for 1h.

In a project with many more executions: 2301 (API) vs 493 (exporter) for 1h. I set RUNDECK_PROJECTS_EXECUTIONS_LIMIT to 300 And will verify it again, later.

rdoering commented 2 years ago

There is no execution missing in between but hudge dela. I guess 300 is way to large :-/.

phsmith commented 2 years ago

Oh, great to know! Yeah, if you have a lot of projects, then higher limits is going to be a problem. Maybe you could try RUNDECK_PROJECTS_EXECUTIONS_CACHE=True or increase the RUNDECK_PROJECTS_EXECUTIONS_LIMIT gradually.

buptzhoutian commented 2 years ago

I think the right way to get the metrics is using execution-query-metrics API.

phsmith commented 2 years ago

I think the right way to get the metrics is using execution-query-metrics API.

I've introduced a new metric rundeck_project_executions_total. Was able to get the total project executions from the endpoint /project/{project_name}/executions that's already used in the exporter.

phsmith commented 2 years ago

Closing this issue. If the problem was not solved, feel free to re-open it.