influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

Version 1.13+ Prometheus module does not export all Consul plug-in metrics. #7387

Open ezombie opened 4 years ago

ezombie commented 4 years ago

After upgrade 1.12.6 to 1.13+ (1.14.1 affected too) i see this picture

Screenshot_2020-04-22_21-26-19

Grafana quiery: sum by (service_name)(consul_health_checks_passing{service_name!=""})

telegraf --config telegraf.conf --config ./telegraf.d/basic_inputs.conf --config ./telegraf.d/consul.conf --test | grep a-B

> consul_health_checks,24fcfa57-97a5-48b4-870a-f3eac365c00d=24fcfa57-97a5-48b4-870a-f3eac365c00d,check_id=service:24fcfa57-97a5-48b4-870a-f3eac365c00d,host=db,xxx=xxx,node=wrk,service_name=a-B check_name="Service 'a-B' check",critical=0i,passing=1i,service_id="24fcfa57-97a5-48b4-870a-f3eac365c00d",status="passing",warning=0i 1587404134000000000
> consul_health_checks,3ac69a89-d337-41c1-8576-fbfae965ce5d=3ac69a89-d337-41c1-8576-fbfae965ce5d,check_id=service:3ac69a89-d337-41c1-8576-fbfae965ce5d,host=db,xxx=xxx,node=wrk,service_name=a-B check_name="Service 'a-B' check",critical=0i,passing=1i,service_id="3ac69a89-d337-41c1-8576-fbfae965ce5d",status="passing",warning=0i 1587404134000000000
> consul_health_checks,5aa02f28-abc4-4e4f-b9cd-5671407b0fb4=5aa02f28-abc4-4e4f-b9cd-5671407b0fb4,check_id=service:5aa02f28-abc4-4e4f-b9cd-5671407b0fb4,host=db,xxx=xxx,node=wrk,service_name=a-B check_name="Service 'a-B' check",critical=0i,passing=1i,service_id="5aa02f28-abc4-4e4f-b9cd-5671407b0fb4",status="passing",warning=0i 1587404134000000000
> consul_health_checks,92ba66ab-686a-433b-9e68-dcd9bd4beec9=92ba66ab-686a-433b-9e68-dcd9bd4beec9,check_id=service:92ba66ab-686a-433b-9e68-dcd9bd4beec9,host=db,xxx=xxx,node=wrk,service_name=a-B check_name="Service 'a-B' check",critical=0i,passing=1i,service_id="92ba66ab-686a-433b-9e68-dcd9bd4beec9",status="passing",warning=0i 1587404134000000000
> consul_health_checks,b2181ba8-6dd2-4b10-a3a2-41cd72f42379=b2181ba8-6dd2-4b10-a3a2-41cd72f42379,check_id=service:b2181ba8-6dd2-4b10-a3a2-41cd72f42379,host=db,xxx=xxx,node=wrk,service_name=a-B check_name="Service 'a-B' check",critical=0i,passing=1i,service_id="b2181ba8-6dd2-4b10-a3a2-41cd72f42379",status="passing",warning=0i 1587404134000000000
> consul_health_checks,be6e626f-bbd2-475b-b3fe-ad1550590eba=be6e626f-bbd2-475b-b3fe-ad1550590eba,check_id=service:be6e626f-bbd2-475b-b3fe-ad1550590eba,host=db,xxx=xxx,node=wrk,service_name=a-B check_name="Service 'a-B' check",critical=0i,passing=1i,service_id="be6e626f-bbd2-475b-b3fe-ad1550590eba",status="passing",warning=0i 1587404134000000000
> consul_health_checks,check_id=service:deeac742-113d-41f8-b899-97608b9550a4,deeac742-113d-41f8-b899-97608b9550a4=deeac742-113d-41f8-b899-97608b9550a4,host=db,xxx=xxx,node=wrk,service_name=a-B check_name="Service 'a-B' check",critical=0i,passing=1i,service_id="deeac742-113d-41f8-b899-97608b9550a4",status="passing",warning=0i 1587404134000000000
> consul_health_checks,check_id=service:ff26db26-a4ae-49d2-90cb-ce961aa3adc0,ff26db26-a4ae-49d2-90cb-ce961aa3adc0=ff26db26-a4ae-49d2-90cb-ce961aa3adc0,host=db,xxx=xxx,node=wrk,service_name=a-B check_name="Service 'a-B' check",critical=0i,passing=1i,service_id="ff26db26-a4ae-49d2-90cb-ce961aa3adc0",status="passing",warning=0i 1587404134000000000

curl http://127.0.0.1:9273/metrics | grep -i a-B

consul_health_checks_critical{ac69a89_d337_41c1_8576_fbfae965ce5d="3ac69a89-d337-41c1-8576-fbfae965ce5d",check_id="service:3ac69a89-d337-41c1-8576-fbfae965ce5d",dc="DC",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 0
consul_health_checks_critical{aa02f28_abc4_4e4f_b9cd_5671407b0fb4="5aa02f28-abc4-4e4f-b9cd-5671407b0fb4",check_id="service:5aa02f28-abc4-4e4f-b9cd-5671407b0fb4",dc="DC",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 0
consul_health_checks_critical{b2181ba8_6dd2_4b10_a3a2_41cd72f42379="b2181ba8-6dd2-4b10-a3a2-41cd72f42379",check_id="service:b2181ba8-6dd2-4b10-a3a2-41cd72f42379",dc="DC",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 0
consul_health_checks_critical{be6e626f_bbd2_475b_b3fe_ad1550590eba="be6e626f-bbd2-475b-b3fe-ad1550590eba",check_id="service:be6e626f-bbd2-475b-b3fe-ad1550590eba",dc="DC",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 0
consul_health_checks_critical{check_id="service:deeac742-113d-41f8-b899-97608b9550a4",dc="DC",deeac742_113d_41f8_b899_97608b9550a4="deeac742-113d-41f8-b899-97608b9550a4",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 0
consul_health_checks_critical{check_id="service:ff26db26-a4ae-49d2-90cb-ce961aa3adc0",dc="DC",env="prod",ff26db26_a4ae_49d2_90cb_ce961aa3adc0="ff26db26-a4ae-49d2-90cb-ce961aa3adc0",host="db",xxx="xxx",node="wrk",service_name="a-B"} 0

consul_health_checks_passing{ac69a89_d337_41c1_8576_fbfae965ce5d="3ac69a89-d337-41c1-8576-fbfae965ce5d",check_id="service:3ac69a89-d337-41c1-8576-fbfae965ce5d",dc="DC",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 1
consul_health_checks_passing{aa02f28_abc4_4e4f_b9cd_5671407b0fb4="5aa02f28-abc4-4e4f-b9cd-5671407b0fb4",check_id="service:5aa02f28-abc4-4e4f-b9cd-5671407b0fb4",dc="DC",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 1
consul_health_checks_passing{b2181ba8_6dd2_4b10_a3a2_41cd72f42379="b2181ba8-6dd2-4b10-a3a2-41cd72f42379",check_id="service:b2181ba8-6dd2-4b10-a3a2-41cd72f42379",dc="DC",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 1
consul_health_checks_passing{be6e626f_bbd2_475b_b3fe_ad1550590eba="be6e626f-bbd2-475b-b3fe-ad1550590eba",check_id="service:be6e626f-bbd2-475b-b3fe-ad1550590eba",dc="DC",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 1
consul_health_checks_passing{check_id="service:deeac742-113d-41f8-b899-97608b9550a4",dc="DC",deeac742_113d_41f8_b899_97608b9550a4="deeac742-113d-41f8-b899-97608b9550a4",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 1
consul_health_checks_passing{check_id="service:ff26db26-a4ae-49d2-90cb-ce961aa3adc0",dc="DC",env="prod",ff26db26_a4ae_49d2_90cb_ce961aa3adc0="ff26db26-a4ae-49d2-90cb-ce961aa3adc0",host="db",xxx="xxx",node="wrk",service_name="a-B"} 1

consul_health_checks_warning{ac69a89_d337_41c1_8576_fbfae965ce5d="3ac69a89-d337-41c1-8576-fbfae965ce5d",check_id="service:3ac69a89-d337-41c1-8576-fbfae965ce5d",dc="DC",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 0
consul_health_checks_warning{aa02f28_abc4_4e4f_b9cd_5671407b0fb4="5aa02f28-abc4-4e4f-b9cd-5671407b0fb4",check_id="service:5aa02f28-abc4-4e4f-b9cd-5671407b0fb4",dc="DC",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 0
consul_health_checks_warning{b2181ba8_6dd2_4b10_a3a2_41cd72f42379="b2181ba8-6dd2-4b10-a3a2-41cd72f42379",check_id="service:b2181ba8-6dd2-4b10-a3a2-41cd72f42379",dc="DC",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 0
consul_health_checks_warning{be6e626f_bbd2_475b_b3fe_ad1550590eba="be6e626f-bbd2-475b-b3fe-ad1550590eba",check_id="service:be6e626f-bbd2-475b-b3fe-ad1550590eba",dc="DC",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 0
consul_health_checks_warning{check_id="service:deeac742-113d-41f8-b899-97608b9550a4",dc="DC",deeac742_113d_41f8_b899_97608b9550a4="deeac742-113d-41f8-b899-97608b9550a4",env="prod",host="db",xxx="xxx",node="wrk",service_name="a-B"} 0
consul_health_checks_warning{check_id="service:ff26db26-a4ae-49d2-90cb-ce961aa3adc0",dc="DC",env="prod",ff26db26_a4ae_49d2_90cb_ce961aa3adc0="ff26db26-a4ae-49d2-90cb-ce961aa3adc0",host="db",xxx="xxx",node="wrk",service_name="a-B"} 0

System info:

Telegraf 1.12.6 and 1.13.0 Consul v1.6.2 Prometheus 2.15.2 Centos 7.8

Steps to reproduce:

upgrade 1.12.6 to 1.13+

Expected behavior:

Actual behavior:

Additional info:

danielnelson commented 4 years ago

Can you add your prometheus_client output plugin configuration?

ezombie commented 4 years ago
cat /etc/telegraf/telegraf.d/prometheus.conf 
# Configuration for the Prometheus client to spawn
[[outputs.prometheus_client]]
  ## Address to listen on
  listen = "0.0.0.0:9273"
  expiration_interval = "10s"
  string_as_label = false
#  metric_version = 2
danielnelson commented 4 years ago

Looking into this a bit closer, and the issue appears to be that labels starting with a 0-9 are illegal in Prometheus format and are rejected by the official library, in Telegraf 1.13 we updated the library and it has become more strict preventing these.

If you switch to metric_version = 2, it should output the metrics that don't have any labels starting with a number, but it will still drop those that do.

I think the best way forward is to adjust the consul input to avoid these types of tags. What if you disable the tag_delimiter option in the consul input?

ezombie commented 4 years ago

cat consul.conf on version 1.14.2 and the configuration file, the problem persists.

A possible solution would be to introduce an additional option into the consul module that will rename the metrics to a template that will be correct for the prometheus library.

[[inputs.consul]]
    interval = "10s"
    datacentre = "dc"
    address = "consul:8500"
danielnelson commented 4 years ago

Having UUIDs as the tagkey is not an ideal setup for any output, so I think we can come up with a better strategy for creating metrics. Can you show the output of telegraf --input-filter consul --test | grep a-B using the configuration without tag_delimiter?

ezombie commented 4 years ago
2020-04-30T18:18:01Z I! Starting Telegraf 1.14.2
> consul_health_checks,2947540c-a0bb-4549-b76d-0b6188036b8a=2947540c-a0bb-4549-b76d-0b6188036b8a,check_id=service:2947540c-a0bb-4549-b76d-0b6188036b8a,host=XXX,n=n,node=XXX,service_name=YYY check_name="Service 'YYY' check",critical=0i,passing=1i,service_id="2947540c-a0bb-4549-b76d-0b6188036b8a",status="passing",warning=0i 1588270681000000000
> consul_health_checks,89427f45-2034-4c08-a12a-bb17baf0fb8d=89427f45-2034-4c08-a12a-bb17baf0fb8d,check_id=service:89427f45-2034-4c08-a12a-bb17baf0fb8d,host=XXX,n=n,node=XXX,service_name=YYY check_name="Service 'YYY' check",critical=0i,passing=1i,service_id="89427f45-2034-4c08-a12a-bb17baf0fb8d",status="passing",warning=0i 1588270681000000000
> consul_health_checks,9ea180bd-9bc1-4739-b3b0-7c9d479124b6=9ea180bd-9bc1-4739-b3b0-7c9d479124b6,check_id=service:9ea180bd-9bc1-4739-b3b0-7c9d479124b6,host=XXX,n=n,node=XXX,service_name=YYY check_name="Service 'YYY' check",critical=0i,passing=1i,service_id="9ea180bd-9bc1-4739-b3b0-7c9d479124b6",status="passing",warning=0i 1588270681000000000
> consul_health_checks,af96bbf7-eb1f-4282-8acb-dd3890e40d20=af96bbf7-eb1f-4282-8acb-dd3890e40d20,check_id=service:af96bbf7-eb1f-4282-8acb-dd3890e40d20,host=XXX,n=n,node=XXX,service_name=YYY check_name="Service 'YYY' check",critical=0i,passing=1i,service_id="af96bbf7-eb1f-4282-8acb-dd3890e40d20",status="passing",warning=0i 1588270681000000000
> consul_health_checks,bf0d639f-b667-43ba-8d55-11c444229b80=bf0d639f-b667-43ba-8d55-11c444229b80,check_id=service:bf0d639f-b667-43ba-8d55-11c444229b80,host=XXX,n=n,node=XXX,service_name=YYY check_name="Service 'YYY' check",critical=0i,passing=1i,service_id="bf0d639f-b667-43ba-8d55-11c444229b80",status="passing",warning=0i 1588270681000000000
> consul_health_checks,check_id=service:d1359ddb-462d-4efe-969f-4bd0032b0d31,d1359ddb-462d-4efe-969f-4bd0032b0d31=d1359ddb-462d-4efe-969f-4bd0032b0d31,host=XXX,n=n,node=XXX,service_name=YYY check_name="Service 'YYY' check",critical=0i,passing=1i,service_id="d1359ddb-462d-4efe-969f-4bd0032b0d31",status="passing",warning=0i 1588270681000000000
> consul_health_checks,check_id=service:e630c30b-2b28-49e5-895c-dcc1d3ac971e,e630c30b-2b28-49e5-895c-dcc1d3ac971e=e630c30b-2b28-49e5-895c-dcc1d3ac971e,host=XXX,n=n,node=XXX,service_name=YYY check_name="Service 'YYY' check",critical=0i,passing=1i,service_id="e630c30b-2b28-49e5-895c-dcc1d3ac971e",status="passing",warning=0i 1588270681000000000
> consul_health_checks,check_id=service:f32802d2-9c2b-4b7e-b3c3-067f3efc4dc6,f32802d2-9c2b-4b7e-b3c3-067f3efc4dc6=f32802d2-9c2b-4b7e-b3c3-067f3efc4dc6,host=XXX,n=n,node=XXX,service_name=YYY check_name="Service 'YYY' check",critical=0i,passing=1i,service_id="f32802d2-9c2b-4b7e-b3c3-067f3efc4dc6",status="passing",warning=0i 1588270681000000000
> consul_health_checks,check_id=service:f7e94e9c-2054-45c4-a362-46b78fafedd1,f7e94e9c-2054-45c4-a362-46b78fafedd1=f7e94e9c-2054-45c4-a362-46b78fafedd1,host=XXX,n=n,node=XXX,service_name=YYY check_name="Service 'YYY' check",critical=0i,passing=1i,service_id="f7e94e9c-2054-45c4-a362-46b78fafedd1",status="passing",warning=0i 1588270681000000000
danielnelson commented 4 years ago

Can you run this query against the consul HTTP api in order to get the raw JSON for one of the check_id that produces a UUID tagkey:

curl -G http://consul:8500/v1/health/state/any --data-urlencode 'filter=CheckID == "service:2947540c-a0bb-4549-b76d-0b6188036b8a"'
danielnelson commented 4 years ago

Quick follow-up, what I'm expecting to see is that you have ServiceTags like:

"ServiceTags": [
    "2947540c-a0bb-4549-b76d-0b6188036b8a"
],

I'm far from a Consul expert, so to me tags like this seem a bit odd. Can you tell me a bit about how you use this type of tag?

ezombie commented 4 years ago
curl -G http://consul:8500/v1/health/state/any --data-urlencode 'filter=CheckID == "service:2947540c-a0bb-4549-b76d-0b6188036b8a"'
[{"Node":"XXX","CheckID":"service:2947540c-a0bb-4549-b76d-0b6188036b8a","Name":"Service 'YYY' check","Status":"passing","Notes":"","Output":"HTTP GET http://127.0.0.1:41615/health/?service=2947540c-a0bb-4549-b76d-0b6188036b8a: 200 OK Output: ","ServiceID":"2947540c-a0bb-4549-b76d-0b6188036b8a","ServiceName":"YYY","ServiceTags":["n","2947540c-a0bb-4549-b76d-0b6188036b8a"],"Type":"http","Definition":{},"CreateIndex":452885815,"ModifyIndex":452885837}]
danielnelson commented 4 years ago

I think what will be best in your case is to exclude these tags. The information is contained in the check_id tag so adding the UUID is superfluous:

[[inputs.consul]]
  tagexclude = ["[!0-9]*"]

As a more general fix, perhaps we should add a new option that matches only ServiceTags, similar to how the docker plugin is structured:

[[inputs.consul]]
  service_tag_include = []
  service_tag_exclude = ["[0-9]*"]
ekbfh commented 4 years ago

Hello! Also have this problem. My setup: i have consul and i put some uniq uuid in tags meta for each service. Consul allows this operation with limits: Key can contain only ASCII chars and no special characters (A-Z a-z 0-9 _ and -). https://www.consul.io/docs/agent/services.html

But Prometheus can't take labels with first digit: Label names may contain ASCII letters, numbers, as well as underscores. They must match the regex [a-zA-Z_][a-zA-Z0-9_]* https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels

Maybe should have an option, which shows if we wants to see tags as labels or not? Or any regex for including this tags, not all of them.

At this moment i see that valid consul meta configuration can affect on some metrics(!!) not even labels disappear.

Same theme was mentioned in several topics: my issue with tags: https://github.com/influxdata/telegraf/issues/5522 PR where tags as labels was introduced: https://github.com/influxdata/telegraf/pull/4155

danielnelson commented 4 years ago

@ekbfh What do you think about if we add the service_tag_include and service_tag_exclude options above?

ekbfh commented 4 years ago

@danielnelson It might work, if you plan enable them by default. Cause as i say, i may have this naming in consul and cannot in prom.

Could you also add an option to choose what tags i want to gather? For ex: gather_all_tags = true/false, cause without this i have bigger cardinality.

danielnelson commented 4 years ago

You would be able to exclude all service tags with service_tag_exclude = ["*"].

We should also make sure that the prometheus output just removes tags that it cannot encode as labels, without removing the output.

ekbfh commented 4 years ago

Yes, just removing tags is a good idea

ekbfh commented 4 years ago

Any update?