hashicorp / vault

A tool for secrets management, encryption as a service, and privileged access management
https://www.vaultproject.io/
Other
30.96k stars 4.18k forks source link

metrics: secrets engines should be in labels, not metric names (prometheus) #9068

Open lrstanley opened 4 years ago

lrstanley commented 4 years ago

Is your feature request related to a problem? Please describe.

Prometheus metrics are output with the following format today (snippet pulled from live cluster):

vault_route_read_<engine>_{quantile="0.5"} 0.031690001487731934
vault_route_read_<engine>_{quantile="0.9"} 0.05686499923467636
vault_route_read_<engine>_{quantile="0.99"} 0.15240700542926788
vault_route_read_<engine>__sum 568405.0729739927
vault_route_read_<engine>__count 1.2668209e+07

Couple of issues:

Describe the solution you'd like

vault_route_read{engine="some-engine", namespace="some-namespace", quantile="0.5"} 0.031690001487731934
vault_route_read{engine="some-engine", namespace="some-namespace", quantile="0.9"} 0.05686499923467636
vault_route_read{engine="some-engine", namespace="some-namespace", quantile="0.99"} 0.15240700542926788
vault_route_read_sum{engine="some-engine", namespace="some-namespace"} 568405.0729739927
vault_route_read_count{engine="some-engine", namespace="some-namespace"} 1.2668209e+07

With the Go libraries for Prometheus, this should actually be easier to implement than what is currently implemented, I suspect.

Describe alternatives you've considered

The only way of getting around this, is to implement metric_relabel_configs definitions on the Prometheus cluster. This would have to be done for each "type" of metric being output by the Vault cluster. This is an anti-pattern though, and shouldn't be necessary.

I'd submit a PR for these things but I simply don't have the time at the moment.

wernerb commented 3 years ago

Any progress on the merge request? or a workaround with metric_relabel_configs?

aphorise commented 3 years ago

Ditto - experiencing as well. A solution is still sought for and any response indicating that it shall be forthcoming would be well appreciated... current matrices naming are not very practical nor do they comply to common standards with most other systems.

azelezni commented 3 years ago

This does the job:

metric_relabel_configs:
  - source_labels: [ __name__ ]
    regex: 'vault_route_([^_]+)_(.+_)(_sum|_count)?'
    replacement: '${2}'
    target_label: mountpoint
  - source_labels: [ __name__ ]
    regex: 'vault_route_([^_]+)_(.+_)(_sum|_count)?'
    replacement: '${1}'
    target_label: action
  - source_labels: [ __name__ ]
    regex: 'vault_route_([^_]+)_(.+_)(_sum|_count)?'
    replacement: 'vault_route${3}'
    target_label: __name__
aphorise commented 2 years ago

@lrstanley - does the solution from @azelezni suffice - and if so perhaps we can get it documented on a related PR and thereafter close this request. WDYT?

lrstanley commented 2 years ago

As described in the initial issue body, I don't think it's sufficient.

Describe alternatives you've considered

The only way of getting around this, is to implement metric_relabel_configs definitions on the Prometheus cluster. This would have to be done for each "type" of metric being output by the Vault cluster. This is an anti-pattern though, and shouldn't be necessary.

Requiring users to utilize relabel configs would also only solve one of the issues, the other being:

Values defaulting to NaN (not shown above) makes querying difficult. You have to do > 0 in all queries that default to NaN. These should default to 0, as it's natively supported in Prometheus to support counter resets (i.e. cluster restarts).

In short, the current metrics implementation with Vault (for Prometheus metrics at least), doesn't follow any standards/best practices and should be revisited.

aphorise commented 2 years ago

@lrstanley I'm curious if you have a suggestion that you may be able to share via a linked PR on how it could be better?

Any correction here would naturally be a disruptive change to existing Prom consumers but I suspect if a proposed solution is already in place and tested (maybe with some later feature flag to enable) then the community would get ample time to benefit from standard / best practises.

lrstanley commented 2 years ago

@lrstanley I'm curious if you have a suggestion that you may be able to share via a linked PR on how it could be better?

Any correction here would naturally be a disruptive change to existing Prom consumers but I suspect if a proposed solution is already in place and tested (maybe with some later feature flag to enable) then the community would get ample time to benefit from standard / best practises.

Most of this is covered in the issue description. I don't have bandwidth at the moment to look at creating a PR.

Thor77 commented 1 year ago

12741 fixes this for audit device metrics and I just created #20190 to fix this for request and rollback metrics.

heatherezell commented 6 months ago

Hi folks! Is this still an issue in newer versions of Vault? Please let me know so I can bubble it up accordingly. Thanks!

Thor77 commented 6 months ago

@hsimon-hashicorp yes, I mentioned two PRs above that would fix this