hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.74k stars 1.94k forks source link

Metric names contain duplicated words #9988

Open picatz opened 3 years ago

picatz commented 3 years ago

Nomad version

1.0.3

Issue

Maybe related to https://github.com/hashicorp/consul/issues/9732, it seems that there's a bug that is causing duplicate names in metrics:

Screen Shot 2021-02-08 at 1 20 38 PM

Reproduction steps

Enabled metrics on Nomad agents:

...
telemetry {
  collection_interval        = "5s"
  disable_hostname           = true
  prometheus_metrics         = true
  publish_allocation_metrics = true
  publish_node_metrics       = true
} 

Observe the metrics endpoint can duplicated words for various metric types, for example the counters contain two:

$ curl http://localhost:4646/v1/metrics | jq .Counters[].Name
"nomad.memberlist.tcp.accept"
"nomad.memberlist.udp.received"
"nomad.memberlist.udp.sent"
"nomad.nomad.rpc.accept_conn"
"nomad.nomad.rpc.request"

☝️ The first three are labeled as expected. The last two have the duplicated nomad words.

Job file (if appropriate)

Example job file to deploy Prometheus + Grafana on Nomad, connected with Consul:

```hcl variable "datacenters" { type = list(string) default = ["dc1"] } variable "consul_acl_token" { type = string } variable "consul_lb_ip" { type = string } variable "nomad_ca" { type = string default = "nomad-ca.pem" } variable "nomad_cli" { type = string default = "nomad-cli-cert.pem" } variable "nomad_cli_key" { type = string default = "nomad-cli-key.pem" } variable "consul_ca" { type = string default = "consul-ca.pem" } variable "consul_cli" { type = string default = "consul-cli-cert.pem" } variable "consul_cli_key" { type = string default = "consul-cli-key.pem" } job "metrics" { datacenters = var.datacenters group "prometheus" { network { mode = "bridge" } service { name = "prometheus" port = "9090" connect { sidecar_service {} } } ephemeral_disk { size = 10240 # 10 GB migrate = true sticky = true } task "prometheus" { template { change_mode = "restart" destination = "local/prometheus.yml" data = <
tgross commented 3 years ago

I'm not sure I'd consider this a bug (although maybe bad UX or unfortunate name choices). The repeated nomad is referring to two different levels: the application as a whole and the subsystem. So those nomad.nomad metrics are coming from the nomad package within the Nomad application. Whereas nomad.memberlist is coming from the memberlist package, within the Nomad application.

nomad
├── memberlist
│   └── tcp
│       └── accept
└── nomad
    └── rpc
        └── accept_conn
picatz commented 3 years ago

Ah, thank you for that clarification! I'll remove the "bug" label. 👍