Improve audit metrics for longer and parallel runs

stek29 commented 2 years ago

Describe the solution you'd like We’re running OPA Gatekeeper on a huge cluster, and judging from logs audit run takes about two minutes to complete.

However, the largest bucket in gatekeeper_audit_duration_seconds_bucket is five seconds, so this metric becomes useless for monitoring if audit takes more than 5 seconds to complete (which is the case even for a small dev cluster we have).

Maybe buckets for the metric should be adjusted, probably in logarithmic scale even?

Another viable option would be to have a metric like gatekeeper_audit_last_run_time, with last time audit operation has actually finished.

Environment:

Gatekeeper version: 3.8.0, but afaik nothing's changed on this regard in 3.9.0 – will test later this week
Kubernetes version: (use kubectl version): doesn't seem to be relevant

stek29 commented 2 years ago

considering that multiple audits probably should not be running at the same time, and audit interval is most likely going to be in range of minutes, I'd suggest ditching the histogram metrics and replacing it with a gauge which would contain last audit duration.

maxsmythe commented 2 years ago

We should definitely increase the scope of the histogram, which was never widened after the prototype. I think buckets may still be useful just so users can see whether audit duration is relatively stable or has a wide distribution.

I like the idea of also having a last-finished time. Currently we export the start time for audits, but that's not super useful if you want to make sure audits have been completing.

ritazh commented 2 years ago

Thanks for raising this! @stek29!

Maybe buckets for the metric should be adjusted, probably in logarithmic scale even?

+1 This seems more dynamic.

stek29 commented 2 years ago

@maxsmythe I'm not sure that last-finished time is needed – I'm currently using gatekeeper_audit_duration_seconds_count to see when audits are completed.

However, there's a problem with last-finished time metric (special metric or derived from gatekeeper_audit_duration_seconds_count) – gatekeeper does start new audit on schedule if the last one doesn't finish – so there's no way to know which audit has finished last.

For example, if audit interval is 5m and audit run takes 1m to finish, then "last-finished time" would be one minute after "last-starting time" – but metrics would be the same if audit takes 6m to finish, 11m to finish, and so on.

how about this? 1) Expand histogram scope 2) Add two more metrics – gatekeeper_audit_last_run_duration, gatekeeper_audit_last_run_ending_time

Also, I'm not sure on what's the usual audit duration for other gatekeeper installations – mine take about 5-6 minutes to finish. Maybe histogram scope should be configurable, if usual audit durations differ significantly for different users?

maxsmythe commented 2 years ago

how about this?

Expand histogram scope

Add two more metrics – gatekeeper_audit_last_run_duration, gatekeeper_audit_last_run_ending_time

I think that makes a lot of sense. It might also be worth storing a gauge that lists the number of concurrently running audits, since that might help with memory issues.

If possible, a metric that shows the age of the oldest running audit could also be useful

Also, I'm not sure on what's the usual audit duration for other gatekeeper installations – mine take about 5-6 minutes to finish. Maybe histogram scope should be configurable, if usual audit durations differ significantly for different users?

Configurable can make sense, but we probably want a default that at least gives some info.

Setting sub-1-second durations was definitely optimistic. I'd probably start with 1 minute:

1 5 10 20 40 80 160 320 640 1280 minutes for buckets?

acpana commented 2 years ago

@ritazh @maxsmythe @stek29 do y'all consider this issue ~fixed~ done now ? also should we advertise the buckets in the docs ?

maxsmythe commented 2 years ago

I defer to @stek29 as to whether he's satisfied with the issue. WRT advertising buckets, no need.

open-policy-agent / gatekeeper

Improve audit metrics for longer and parallel runs #2197