hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.86k stars 1.95k forks source link

Unsetting Nomad metadata leaks keys #18603

Closed stephaniesac closed 1 year ago

stephaniesac commented 1 year ago

Nomad version

We tested with Nomad v1.5.5, but we think this reproduces in tip.

Operating system and Environment details

Unix

Issue

Unsetting node metadata keys leave them as null on the client node until it restarts. This causes steady increase in memory, and there doesn’t seem to be another way of removing the keys otherwise.

Description

The Update Node Metadata HTTP API allows us to dynamically set key-value metadata. The documentation mentions that one can ‘unset’ keys by requesting updates with value=null.

After waiting 10s for the node data to be propagated to the scheduler, the response from Read Node Metadata HTTP API shows that the unset key is removed from the ‘Meta’ field, but remains in ‘Dynamic’. In our local reproduction, we inspected the RAFT state and confirmed that the scheduler does not track Meta keys with null values, so this issue is limited to client nodes.

The only way to clear unset keys from ‘Dynamic’ was to restart the node. In code, the client clones the ‘Dynamic’ meta map (code) and thus does not end up deleting the keys as we do in standard ‘Meta’.

We suspect that this behavior exists in order to be able to unset static Meta keys. However, keeping keys with null values which are not in ‘Static’ metadata seems meaningless and undesirable because the client accumulates memory indefinitely. Ideally we should only persist keys with null values if those same keys exist in ‘Static’ metadata.

Proposal

Persist ‘Dynamic’ metadata keys with null values only if those same keys exist in ‘Static’ metadata.

Reproduction steps

Start a local cluster with 1 client and 1 server, Nomad v1.5.5:

$ cat server_config.hcl
data_dir=”/tmp”
log_level = "TRACE"
advertise {
  http = "127.0.0.1"
  rpc = "127.0.0.1"
  serf = "127.0.0.1"
}
server {
  enabled = true
  bootstrap_expect = 1
  job_gc_interval = "1m"
  job_gc_threshold = "24h"
  eval_gc_threshold = "1m"
}

$ cat client_config.hcl
data_dir = "/tmp"
log_level = "debug"
advertise {
  http = "127.0.0.1"
  rpc = "127.0.0.1"
  serf = "127.0.0.1"
}
ports {
  http = "9876"
  rpc = "9875"
  serf = "9874"
}
client {
  enabled = true
  servers = ["127.0.0.1"]
  gc_max_allocs = 1
}
plugin "raw_exec" {
  config {
    enabled = true
  }
}

Add a “test-key” metadata with value null that doesn’t exist as static metadata: $ echo "{\"Meta\": { \"test-key\": null}}" | nomad operator api /v1/client/metadata?node_id=NODE_ID

Expected Result

After waiting 10s, $ nomad operator api /v1/client/metadata?node_id=NODE_ID will not include test-key: null in the 'Dynamic' section of metadata since test-key is not in 'Static'.

Actual Result

Regardless of waiting 10s or send SIGHUP to your nomad client, the unset key remains in Dynamic:

$ nomad operator api /v1/client/metadata?node_id=NODE_ID 
{
  "Meta": {
    "connect.gateway_image": "envoyproxy/envoy:v${NOMAD_envoy_version}",
    "connect.log_level": "info",
    "connect.proxy_concurrency": "1",
    "connect.sidecar_image": "envoyproxy/envoy:v${NOMAD_envoy_version}"
  },
  "Dynamic": {
    "test-key": null
  },
  "Static": {
    "connect.gateway_image": "envoyproxy/envoy:v${NOMAD_envoy_version}",
    "connect.log_level": "info",
    "connect.proxy_concurrency": "1",
    "connect.sidecar_image": "envoyproxy/envoy:v${NOMAD_envoy_version}"
  }
}
lgfa29 commented 1 year ago

Thanks for the detailed report @stephaniesac 😄

I've opened #18664 to address this problem and ensure dynamic node metadata keys are removed from client memory when unset.