lrstanley commented 4 years ago

Is your feature request related to a problem? Please describe.

Prometheus metrics are output with the following format today (snippet pulled from live cluster):

vault_route_read_<engine>_{quantile="0.5"} 0.031690001487731934
vault_route_read_<engine>_{quantile="0.9"} 0.05686499923467636
vault_route_read_<engine>_{quantile="0.99"} 0.15240700542926788
vault_route_read_<engine>__sum 568405.0729739927
vault_route_read_<engine>__count 1.2668209e+07

Couple of issues:

Unneeded trailing underscores in metric names.
Unneeded double underscores in metrics names.
Values defaulting to NaN (not shown above) makes querying difficult. You have to do > 0 in all queries that default to NaN. These should default to 0, as it's natively supported in Prometheus to support counter resets (i.e. cluster restarts).
The main issue: secrets engines should not be in the metric names.
- This makes it very difficult to do any kind of querying/dashboarding when you're not using static secrets engines (we have a few hundred secrets engines per cluster, all of which are dynamically created, and have dynamically created policies).
- It means that to query things like "total reads", you have to use Prometheus __name__=~"vault_route_read_.+" label queries. However, this does not work in many situations (e.g. increase() queries). Using __name__ is also heavily discouraged:
- https://github.com/prometheus/prometheus/issues/4562#issuecomment-417447460
  
  Using regexes on metric names is very strongly discouraged, outside of debugging Prometheus performance issues.
- https://www.robustperception.io/whats-in-a-__name__
  
  Finally I'd like to talk about the most common reason users run into this. Users are trying to perform math across multiple metrics using an expression like rate({name=~"pool_.*_total"}[5m]), where there's a regex on the metric name. This is an anti-pattern, as you're meant to know your metric names a-priori. The only time you should use a regex against metric names is when doing performance debugging or metric exploration.
  
  The correct way to handle this is to use a label to distinguish the different pools (or whatever it is you're trying to distinguish), rather than encoding them inside the metric name. If it's not possible to do this in the code, you can workaround using metric_relabel_configs with replace actions to a) extract out the label and then b) set the metric name to a single value.

Describe the solution you'd like

Remove uses of unneeded underscore characters.
Default metrics to 0 rather than NaN.
Move secrets engines and similar types of info currently put into metric names, into labels that allow proper querying. e.g.:

vault_route_read{engine="some-engine", namespace="some-namespace", quantile="0.5"} 0.031690001487731934
vault_route_read{engine="some-engine", namespace="some-namespace", quantile="0.9"} 0.05686499923467636
vault_route_read{engine="some-engine", namespace="some-namespace", quantile="0.99"} 0.15240700542926788
vault_route_read_sum{engine="some-engine", namespace="some-namespace"} 568405.0729739927
vault_route_read_count{engine="some-engine", namespace="some-namespace"} 1.2668209e+07

With the Go libraries for Prometheus, this should actually be easier to implement than what is currently implemented, I suspect.

Describe alternatives you've considered

The only way of getting around this, is to implement metric_relabel_configs definitions on the Prometheus cluster. This would have to be done for each "type" of metric being output by the Vault cluster. This is an anti-pattern though, and shouldn't be necessary.

I'd submit a PR for these things but I simply don't have the time at the moment.

wernerb commented 3 years ago

Any progress on the merge request? or a workaround with metric_relabel_configs?

aphorise commented 3 years ago

Ditto - experiencing as well. A solution is still sought for and any response indicating that it shall be forthcoming would be well appreciated... current matrices naming are not very practical nor do they comply to common standards with most other systems.

azelezni commented 3 years ago

This does the job:

metric_relabel_configs:
  - source_labels: [ __name__ ]
    regex: 'vault_route_([^_]+)_(.+_)(_sum|_count)?'
    replacement: '${2}'
    target_label: mountpoint
  - source_labels: [ __name__ ]
    regex: 'vault_route_([^_]+)_(.+_)(_sum|_count)?'
    replacement: '${1}'
    target_label: action
  - source_labels: [ __name__ ]
    regex: 'vault_route_([^_]+)_(.+_)(_sum|_count)?'
    replacement: 'vault_route${3}'
    target_label: __name__

aphorise commented 2 years ago

@lrstanley - does the solution from @azelezni suffice - and if so perhaps we can get it documented on a related PR and thereafter close this request. WDYT?

lrstanley commented 2 years ago

As described in the initial issue body, I don't think it's sufficient.

Describe alternatives you've considered

The only way of getting around this, is to implement metric_relabel_configs definitions on the Prometheus cluster. This would have to be done for each "type" of metric being output by the Vault cluster. This is an anti-pattern though, and shouldn't be necessary.

Requiring users to utilize relabel configs would also only solve one of the issues, the other being:

Values defaulting to NaN (not shown above) makes querying difficult. You have to do > 0 in all queries that default to NaN. These should default to 0, as it's natively supported in Prometheus to support counter resets (i.e. cluster restarts).

In short, the current metrics implementation with Vault (for Prometheus metrics at least), doesn't follow any standards/best practices and should be revisited.

aphorise commented 2 years ago

@lrstanley I'm curious if you have a suggestion that you may be able to share via a linked PR on how it could be better?

Any correction here would naturally be a disruptive change to existing Prom consumers but I suspect if a proposed solution is already in place and tested (maybe with some later feature flag to enable) then the community would get ample time to benefit from standard / best practises.

lrstanley commented 2 years ago

@lrstanley I'm curious if you have a suggestion that you may be able to share via a linked PR on how it could be better?

Any correction here would naturally be a disruptive change to existing Prom consumers but I suspect if a proposed solution is already in place and tested (maybe with some later feature flag to enable) then the community would get ample time to benefit from standard / best practises.

Most of this is covered in the issue description. I don't have bandwidth at the moment to look at creating a PR.

Thor77 commented 1 year ago

12741 fixes this for audit device metrics and I just created #20190 to fix this for request and rollback metrics.

heatherezell commented 6 months ago

Hi folks! Is this still an issue in newer versions of Vault? Please let me know so I can bubble it up accordingly. Thanks!

Thor77 commented 6 months ago

@hsimon-hashicorp yes, I mentioned two PRs above that would fix this

hashicorp / vault

metrics: secrets engines should be in labels, not metric names (prometheus) #9068

12741 fixes this for audit device metrics and I just created #20190 to fix this for request and rollback metrics.