Add and expose metrics - Githubissues

ncdc commented 7 years ago

Add various metrics using Prometheus.

timoreimann commented 6 years ago

As announced on Slack, we (i.e., me and/or some of my team colleagues) would be keen on adding some initial support for Prometheus-based metrics collection.

Is there anything in particular that should be included from the project maintainer's / Ark community's view? The baseline for us would probably be somewhere along the lines of collecting info about

when a backup / restore took place
how many objects were included
how many errors occurred

That would allow us to set up alerts like "notify me when the last successful backup is older than x time units" and "notify me when less than x objects were considered".

Implementation-wise, we usually just include the official Go Prometheus client library. So far, it has served us well.

Happy to discuss design considerations and implementation details up front.

ncdc commented 6 years ago

@timoreimann I think all of those are good ideas. I'm also interested in things like

time to list all pods, services, etc
time to write the backup tarball
time to upload a file to object storage
size of tarball, log file, etc
time to run custom backup actions and restorers

+1 to the official client lib. We're using dep now so you can run hack/dep-save.sh, where any arguments you pass are handed directly to dep ensure.

I think it would be a good idea to see a rough sketch of how you envision integrating metrics gathering into the code base.

Thanks!

timoreimann commented 6 years ago

Appreciate the prompt feedback, @ncdc. I'll try to dedicate some time next week to dig into the relevant code sections and come up with a metrics master plan to post here. :-)

timoreimann commented 6 years ago

I have finally found some time to look into the code and give some thought to how we could integrate Prometheus support into Ark.

The first thing I believe we need to decide on is how we should scope the Prometheus library usage. Basically, there are two general approaches you can observe in the wild:

Using Prometheus' global default registry to register the metrics. The advantage of that is that we don't require any special configuration for the registry so that the metrics will be exported on the /metrics endpoint. The downside is that it makes testing harder as you need to take care of global cleanup. It also limits parallel testing.
Instantiate our own Prometheus registry and hand it through as an explicit dependency to the structs / functions, basically in the same way as it is done with the logger in Ark. This solves the testing issue, but requires us to pass along our registry to every function we want to track metrics from (or have another wrapping object to hold the registry for all consumers). Depending on the extent to which we want to track metrics spread across Ark, this may need fewer or more adaptions on the API surface level.

So I'd say that option 1. makes usage easier accessible while option 2. allows for a more scoped (better?) design. I see that various functions in Ark already take a fairly large number of dependencies, so it looks like the project would be willing to add another one. Let me know your thoughts on this front.

The other question is how we'd design and structure the concrete metrics. My personal experience on building metrics is that trying to design the very details up front only has so much value; more often than not, I had to make adjustments as I integrated the metrics into some kind of graph or dashboard. It'd only be then when I realized that a design decision made earlier doesn't necessarily make a lot of sense anymore once used in a real-world scenario for the first time. Therefore, I'd suggest that once we make a decision on the scoping question, I'll go ahead and implement a basic prototype to see if it fits the needs. This would form the basis for a PR I would then submit, and we could use that to discuss metrics and other implementation details afterwards.

What do you think? Does that make sense to you?

ncdc commented 6 years ago

@timoreimann this makes sense, thanks!

FYI I did stumble across this branch where someone was already doing a POC: https://github.com/heptio/ark/compare/master...chadcatlett:stab_at_metrics. Probably worth getting @chadcatlett involved in the discussion 😄

If we do our own registry, would it be a singleton, or would we ever want or need to have multiple?

Another thing to consider is that we're using gRPC based plugins. We'll need to figure out how to get any metrics generated by plugin instances - either by somehow getting their data into the ark server process, or by sending them to a metrics gateway or to a prometheus server directly. Something to think about...

timoreimann commented 6 years ago

@ncdc I don't think we'll need more one than registry, so it should be a singleton basically. The design options I outlined above should only affect how (easily) we can test things IMHO.

gRPC-based plugins are an interesting point. I'm not too familiar with the plugins architecture in Ark. If the binaries are long-running just like the Ark server, then I'd think that having the server component pull in plugin metrics via a dedicated gRPC function on each Prometheus scrape interval could be an option. OTOH, if the binaries are invoked at certain times only and terminate again while the Ark server keeps on running, letting them submit their metrics through the Prometheus pushgateway makes more sense.

AFAIK, the pushgateway doesn't come with a retention period that cleans up metrics automatically. So when in doubt, the regular pull mechanism seems preferable.

@chadcatlett: I see you have already produced some code. Is there already a PR somewhere? Happy to collaborate / coordinate to get the code into Ark. :-)

ncdc commented 6 years ago

Currently, some plugins are long-lived, while others are very short. Our plan, however, is to make all plugins short-lived, as it should make it easier for us to handle how we manage plugins.

My ideal setup would be to have the Ark server pull metrics from a plugin just before it terminates the plugin. I'm not sure if there's an easy way to do this with Prometheus. I stumbled across

https://github.com/prometheus/client_golang/blob/e11c6ff8170beca9d5fd8b938e71165eeec53ac6/prometheus/examples_test.go#L648-L705

which uses a custom Gatherer, which is a model we could perhaps consider, or something like it.

I'd prefer to avoid requiring setting up an external component such as the pushgateway as the sole means of gathering plugin metrics. It just doesn't feel right to me.

timoreimann commented 6 years ago

Pulling metrics just before the plugin terminates sounds like a viable option if we are able to control the plugin lifecycle.

ncdc commented 6 years ago

Yes, we're 100% in control (assuming the plugin doesn't crash).

jrnt30 commented 6 years ago

We just came back to this today too and would be interested in it's progress and potentially some hands on keyboard implementation in my ever abundant free time :)

For what it's worth, I would trend towards simplicity first on this to determine what/how we want to use it much to a few of your observations @timoreimann. Would it be viable to put in the Global registry without plugin consideration to start? I think this would give us the biggest bang for our buck. To your point, I could see a more generalized TracingContext argument being valuable and perhaps expand beyond simply a Prometheus exporter there.

I haven't researched closely enough the plugin model, but I could see a somewhat simple first pass at those metrics simply being plugin level RED metrics to start. This would give us visibility into the duration of the Plugin execution broken out by the plugin name to start. If it was determined we need more granular trace like information for the internals of the plugin then we could create a "proper interface" for the plugins that could allow the exposing of that data.

I have some similar concerns about the PushGateway integration for this due to the complexity of integration and implementation that may involve.

Would be happy to chat about this with anyone who is passionate about seeing it happen!

ncdc commented 6 years ago

@jrnt30 as long as we have a high-level vision of how we plan to integrate plugin process metrics into the overall metrics registry, I would be fine starting with just the Ark server. But I would hesitate to use the global registry if we can identify up front that it's impossible to integrate metrics from external sources into it easily.

SleepyBrett commented 6 years ago

As a user I'd love a topline metric for each schedule that just counts backup successes and failures. As I just discovered our backups have been failing for weeks by stumbling upon it.

rosskukulinski commented 6 years ago

I'm not familiar enough with the Ark plugin model nor the prometheus client to weigh in on those topics.

That said, I think it's important to differentiate between the consumers of Ark metrics.

@ncdc - as an Ark developer, you likely care about visibility and performance metrics of Ark internals based on the environment it's running in (e.g. how many objects to backup, and how long does it take)

A plugin developer likely will want to have their own metrics as well that are focused on their development process. These metrics need to be namespaced so that two plugins don't collide on metrics. (namespacing could be via different metric name, or via Labeling)

Ark users, however, probably care more about RED metrics (Rates - how many backups have been taken, Errors - how many backups failed, Distribution - how long did it take for backups to complete). @SleepyBrett also makes an excellent point that we may want to have metrics labeled by Backup name and/or Scheduled name.

Quick pass at metrics & types, feedback welcome

# Metric labels for all include:
# * {backup="foo"} -- The name of the backup
# * {schedule="bar"} -- The name of the schedule that triggered this backup

# ark_info used to provide version and other meta data about this Ark instance
# * {version:"v0.9.0"}
# * {branch:"master"}
ark_info - gauge # always set to 1

# ark_backup totals for overall success/failure tracking
ark_backup_attempts - counter
ark_backup_completions - counter
ark_backup_failures - counter

# ark_backup_seconds describes the duration of a backup from start to finish
ark_backup_seconds -- histogram
ark_backup_seconds_count -- counter
ark_backup_seconds_sum -- counter

# ark_backup_size describes the size of the resulting tarball
ark_backup_tarball_size_bytes - histogram
ark_backup_tarball_size_count - counter
ark_backup_tarball_size_sum - counter

# ark_backup_tarball_write_seconds describes the time to write the tarball to disk
ark_backup_tarball_write_seconds
ark_backup_tarball_write_seconds_count
ark_backup_tarball_write_seconds_sum

# ark_backup_tarball_upload_seconds describes the time to upload the tarball
ark_backup_tarball_upload_seconds
ark_backup_tarball_upload_seconds_count
ark_backup_tarball_upload_seconds_sum

# ark_backup_actions_seconds describes the time to run custom backup actions
# Extra labels: action_name
ark_backup_actions_seconds -- histograms
ark_backup_actions_seconds_count -- counter
ark_backup_actions_seconds_sum -- counter

### Now restores

# ark_restore totals for overall success/failure tracking
ark_restore_attempts - counter
ark_restore_completions - counter
ark_restore_failures - counter

# ark_restore_seconds describes the duration of a backup from start to finish
ark_restore_seconds -- histogram
ark_restore_seconds_count -- counter
ark_backup_seconds_sum -- counter

# ark_restore_size describes the size of the tarball
ark_restore_tarball_size_bytes - histogram
ark_restore_tarball_size_count - counter
ark_restore_tarball_size_sum - counter

# ark_restore_tarball_download_seconds describes the time to download the tarball
ark_restore_tarball_download_seconds
ark_restore_tarball_download_seconds_count
ark_restore_tarball_download_seconds_sum

# ark_restore_actions_seconds describes the time to run custom restorers
# Extra labels: restorer_name
ark_restore_restorers_seconds -- histograms
ark_restore_restorers_seconds_count -- counter
ark_restore_restorers_seconds_sum -- counter

I did not include log file as I wasn’t sure how important that is @ncdc
I did not include listing pods, services, etc because I don’t know if it matters the type of object, or just how long it takes to get all the objects, regardless of type
Do we want/need to track the duration of time it takes for a cloud-provider snapshots be taken?
I’m not sure about the ark_restore_restorers — feedback welcome
I didn't include plugins, but I would think they would be namespaced ark{backup,restore}\${plugin-name}_{anything}

tomplus commented 6 years ago

Thumbs up for the RED metrics.

xamox commented 6 years ago

RED metrics are something I'm looking for as well, I want to know if backups are failing most importantly, but all the other things mentioned would be a plus.

corlettb commented 6 years ago

+1

ashish-amarnath commented 6 years ago

Hoping that PR 531 will channel the discussion here and get us the metrics that we want and add more metrics in the future. Keeping the approach simple and as non-intrusive as possible. Feel free to leave feedback on the PR. I will continue to add the RED metrics that seem to be most useful for cluster operators.

ashish-amarnath commented 6 years ago

@ncdc You can assign this issue to me as I am working on the PR anyway. I am unable to do it myself.

ncdc commented 6 years ago

Assigned!

rosskukulinski commented 6 years ago

Added another metric to my list:

# ark_info used to provide version and other meta data about this Ark instance
# * {version:"v0.9.0"}
# * {branch:"master"}
ark_info - gauge # always set to 1

RochesterinNYC commented 6 years ago

Tried out the v0.9.0-alpha.2 Ark with the Prom metrics endpoint support added in https://github.com/heptio/ark/pull/531 (getting the metrics at :8085/metrics works like a charm).

However, with the above metric list/metrics in place now, as the ark_backup_failure_total counter metric is a total, we're not sure how to actually do alerting on this type of metric/the available metrics for failed backups. After one failed backup, this ark_backup_failure_total counter would always be >=1 so alert conditions like "Alert me when ark_backup_failure_total > 0" would always be active/trigger.

jrnt30 commented 6 years ago

You could use the delta or rate as a way of evaluating how that number has changed over time.

On Tue, Jun 19, 2018 at 12:34 PM James Wen notifications@github.com wrote:

Tried out the v0.9.0-alpha.2 Ark with the Prom metrics endpoint support added in #531 https://github.com/heptio/ark/pull/531 (getting the metrics at :8085/metrics works like a charm).

However, with the above metric list/metrics in place now, as the ark_backup_failure_total counter metric is a total, we're not sure how to actually do alerting on this type of metric/the available metrics for failed backups. After one failed backup, this ark_backup_failure_total counter would always be >=1 so alert conditions like "Alert me when ark_backup_failure_total > 0" would always be active/trigger.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/heptio/ark/issues/84#issuecomment-398482281, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWanIf-bZoLSsxSoD97OOFqfq39-8hGks5t-TYogaJpZM4PWbrz .

RochesterinNYC commented 6 years ago

@jrnt30 are you referring to the rate and delta functions in the context of Prometheus? We're unfortunately not using Prometheus. We have an in-house metrics system and we're writing integration scripts/daemons that poll Prometheus metrics endpoints for various components in our clusters and sends/writes those metrics to that in-house metrics system.

It sounds like with the Prometheus metrics paradigm, we wouldn't ever expect the Ark /metrics endpoint to return any kind of rate information and the layer that is processing/polling the Prometheus metrics would need to handle that?

jrnt30 commented 6 years ago

That is true, yes. Typically with counters in Prometheus they only go up and the query is then responsible for analyzing the "impact" of that. rate and delta are two ways of assessing "has this thing changed in my time window" or "what is the rate at which this has changed".

Something similar would be able to be done with your in-house tooling, but can I understand how, without this context, that would be a bit confusing.

Does that clarify things at all?

On Tue, Jun 19, 2018 at 12:53 PM James Wen notifications@github.com wrote:

@jrnt30 https://github.com/jrnt30 are you referring to the rate and delta functions in the context of Prometheus? We're unfortunately not using Prometheus. We have an in-house metrics system and we're writing integration scripts/daemons that poll Prometheus metrics endpoints for various components in our clusters and sends/writes those metrics to that in-house metrics system.

It sounds like with the Prometheus metrics paradigm, we wouldn't ever expect the Ark /metrics endpoint to return any kind of rate information and the layer that is processing/polling the Prometheus metrics would need to handle that?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/heptio/ark/issues/84#issuecomment-398488416, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWanJHC3JSwqXiF5KgHdrCMtFFRavMjks5t-TqzgaJpZM4PWbrz .

RochesterinNYC commented 6 years ago

Yup, that makes sense. With our lack of Prometheus and unwillingness to add all this rate/delta calculation logic into the metrics middle layer (between Prom endpoints and our in-house system), we're probably going to try to emit metrics for "time-since-last-backup" by looking at the GCS bucket backing the Ark backups.

jrnt30 commented 6 years ago

That is one way of doing it for sure. Another approach which I tool while this was being ironed out was writing a small utility that used the Backup CRDs in the cluster itself to determine what the state should be considered.

A word of caution that we ran into is that if you don't have an indicator of what it was the last time you checked then you won't have a good handle on what the new number means potentially. In the end I have a feeling you may end up putting in a similar amount of effort to checkpoint that as you would simply using the prom metrics. A hacky way of doing that would be to just use a configmap to actually write your last known state back into the cluster each time your process "scrapes" the default metrics.

On Tue, Jun 19, 2018 at 1:05 PM James Wen notifications@github.com wrote:

Yup, that makes sense. With our lack of Prometheus and unwillingness to add all this rate/delta calculation logic into the metrics middle layer (between Prom endpoints and our in-house system), we're probably going to try to emit metrics for "time-since-last-backup" by looking at the GCS bucket backing the Ark backups.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/heptio/ark/issues/84#issuecomment-398492252, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWanNOxNssM3VHyiLQoiWC98HoCu1uEks5t-T2GgaJpZM4PWbrz .

ashish-amarnath commented 6 years ago

@RochesterinNYC: So the ideal way to set this up is to do aggregation, think rate in prometheus land, and use a look back window. Here is an example of how to setup your alerts

- alert: FailedBackup
  expr: rate(ark_backup_failure_total[62m]) > 0
  for: 30s
  labels:
    severity: "warning"

munai-das commented 6 years ago

@ashish-amarnath Hi let me know if you need help in this implementation. We too need this feature.

ncdc commented 6 years ago

@munai-das thanks! Please feel free to open a PR to add any metrics you'd like. I'd recommend starting small - pick a single metric or a group of related metrics and do a PR for that.

tomplus commented 6 years ago

I've just installed 0.9.0 and and it works very well. I use Prometheus and these metrics are enough for me to create alerts or diagnose problems. Thank you. :+1:

ncdc commented 5 years ago

Closing now that we have an initial set of metrics implemented. For requests for new metrics, please open new issues. Thanks everyone!

vitobotta commented 5 years ago

Can anyone please clarify how to use these metrics? I have installed Prometheus Operator (which includes Grafana) and then Velero from Helm but I don't know how to access/use the metrics. Thanks

StianOvrevage commented 5 years ago

@vitobotta It's because Velero adds some annotations to inform Prometheus about it's metrics. However, Prometheus now discourages the use of annotations in favor of ServiceMonitors and has disabled the scraping of the annotated targets by default.

Edit: I followed https://github.com/coreos/prometheus-operator/issues/1547#issuecomment-446711073 on how to add the required config and it seems to successfully scrape Velero.

I did two changes to the config, changed regex: 'node-exporter' to regex: 'prometheus-operator-prometheus-node-exporter' since that was the name in our cluster. I also added another exception to avoid double-scraping of kube-state-metrics:

      - source_labels: [__meta_kubernetes_service_name]
        action: drop
        regex: 'prometheus-operator-kube-state-metrics'

archmangler commented 4 years ago

Is there any complete documentation on monitoring velero (preferably via prometheus) ?

vmware-tanzu / velero

Add and expose metrics #84