Emit consul-k8s metrics to Prometheus

pnovotnak commented 4 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Description

consul-k8s components should publish metrics so that people can monitor and alert on bad conditions.

david-yu commented 4 years ago

Hi @pnovotnak thanks for your feedback! Could you give us more information on what bad conditions you are hoping to catch with metrics? Also is there information on what type of monitoring tools you are using (i.e. Prometheus or DataDog)? We have had issues like that come in the past like https://github.com/hashicorp/consul-helm/issues/339 that we've been keeping our eye on.

pnovotnak commented 4 years ago

We use prometheus, which is the standard in k8s environments. Having an endpoint which published prom metrics would allow us to have metrics just by having an annotation present on the pod.

I would like at bare minimum error and success counters, with each increment representing a fetch & update of a single service. Using that I can calculate the error rate as a percentage.

It would be nice to also have request metrics for both k8s and consul requests, maybe as histogram(s). The error rate should capture the important information here (did the request succeed), but this would give me more granular detail that might be useful in post mortems.

On Wed, Apr 22, 2020, 9:24 PM David Yu notifications@github.com wrote:

Hi @pnovotnak https://github.com/pnovotnak thanks for your feedback! Could you give us more information on what bad conditions you are hoping to catch with metrics? Also is there information on what type of monitoring tools you are using (i.e. Prometheus or DataDog)? We have had issues like that come in the past like hashicorp/consul-helm#339 https://github.com/hashicorp/consul-helm/issues/339 that we've been keeping our eye on.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul-k8s/issues/249#issuecomment-618169069, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLS2XLXFCU6DZAI42X6LELRN67GLANCNFSM4MOTVXKQ .

filintod commented 4 years ago

Also we got a recent issue where agent could not connect to the kube api server for some reason (it works after restart) and got continuous

ERROR: logging before flag.Parse: E0713 05:59:29.636613 6 reflector.go:205] pkg/mod/k8s.io/client-go@v8.0.0+incompatible/tools/cache/reflector.go:99: Failed to list *v1.Service: Get https://x.y.z.a:443/api/v1/namespaces/tetris/services?limit=500&resourceVersion=0: dial tcp x.y.z.a:443: connect: connection refused

It should have metrics for connections succeed/failed to kube api server

hashicorp / consul-k8s

Emit consul-k8s metrics to Prometheus #249

Community Note

Description