hashicorp / consul-template

Template rendering, notifier, and supervisor for @HashiCorp Consul and Vault data.
https://www.hashicorp.com/
Mozilla Public License 2.0
4.76k stars 783 forks source link

deduplication data can exceed consul value limit 512-kB #1135

Open mblakele opened 6 years ago

mblakele commented 6 years ago

Consul Template version

$ consul-template -v
consul-template v0.19.5 (57b6c71)

Configuration

log_level = "warn"

max_stale = "10m"

consul {}

deduplicate {
  enabled = true
  prefix  = "consul-template/my-proxy/dedup/"
}

template {
  source      = "/etc/consul-template.d/haproxy.conf"
  destination = "/etc/haproxy/haproxy.cfg"

  command         = "/opt/consul/bin/consul lock -n=1 locks/fubar /usr/sbin/service haproxy reload"
  command_timeout = "30s"

  # limit restarts by waiting for new typeserver(s) to quiesce
  wait {
    min = "60s"
    max = "90s"
  }

  # This allows us to mix consul-template with ansible template
  left_delimiter  = "[["
  right_delimiter = "]]"
}

Template:

[[/*
  marshall the available service names
  The service names we want look like "baz-*"
  Ultimately we want to create a single backend
  for all nodes with a matching service,
  plus a backend for each matching service.
*/]]
[[- $mode := "fubar" -]]
[[- $pattern := "\\bfubar\\b" ]]
[[- range $i, $service := services -]]
  [[- $serviceName := .Name -]]
  [[- $matches := $serviceName | regexMatch "^baz-[^\\(]+$" -]]
  [[- if $matches -]]
    [[- range $j, $tag := .Tags -]]
      [[ if $tag | regexMatch $pattern -]]
        [[- $serviceRef := printf "%s.%s" $tag $serviceName -]]
        [[/* identity map of tags, for backend-all */]]
        [[- scratch.MapSet "all" $tag $tag -]]
        [[/* map of version-specific backend names and tags.name ids */]]
        [[- scratch.MapSet "services" $serviceName $serviceRef ]]
      [[- end -]]
    [[- end -]]
  [[- else ]]
[[- end -]]
[[- end]]

frontend all
    bind /var/run/my-proxy.sock
    default_backend all

    option http-buffer-request
    declare capture request len 4000
    http-request capture req.body id 0
    capture cookie fubarhost= len 32

    [[- range $name, $ref := scratch.Get "services" -]]
    [[- $version := $name | regexReplaceAll "^baz-" "" ]]
    acl is-[[$version]] hdr(X-Fubar-Version) -i [[$version]]
    use_backend be-[[$version]] if is-[[$version]]
    [[- end]]

    # fallthrough, to handle bad version headers: must be last!
    acl has-version hdr(X-Fubar-Version) -m found
    use_backend be-bad-version if has-version

[[/*
  write out all the backend servers, ignoring consul health status,
  using `|any`
  This way the server list should not change frequently,
  even if the health check is flapping.
*/]]
backend all
    [[- range $i, $tag := scratch.MapValues "all" -]]
      [[- $servicePat := printf "%s.baz-all|any" $tag -]]
      [[- range $j, $node := service $servicePat ]]
        [[- $label := printf "all-%s" $node.Address | replaceAll "." "-" -]]
        [[- $hostport := printf "%s:%d" $node.Address $node.Port -]]
        [[- if scratch.Key $node.Address -]]
          [[/* we have already seen this address */]]
        [[- else -]]
          [[- scratch.Set $hostport $hostport]]
    server [[$label]] [[$hostport]] check port 8081 cookie [[$hostport]]
        [[- end -]]
      [[- end -]]
    [[- end]]

[[ range $name, $ref := scratch.Get "services" -]]
[[- $servicePat := printf "%s|any" $ref -]]
[[- $serviceNodes := service $servicePat -]]
[[- $version := $name | regexReplaceAll "^baz-" "" ]]
backend be-[[$version]]
    [[- if lt (len $serviceNodes) 1 ]]
    mode http
    errorfile 503 /etc/haproxy/errors/not-compatible.http
    [[- else -]]
    [[- range $j, $node := $serviceNodes -]]
    [[- $track := printf "all-%s" $node.Address | replaceAll "." "-" -]]
    [[- $label := printf "be-%s-%s" $version $node.Address | replaceAll "." "-" ]]
    server [[$label]] [[$node.Address]]:[[$node.Port]] cookie [[$node.Address]]:[[$node.Port]] track all/[[$track]]
[[- end ]][[end]]
[[end]]

Command

/usr/local/bin/consul-template -config /etc/consul-template.conf

Debug output

https://gist.github.com/mblakele/7efefca93afe7b4e77cbbc886abc7b9c

Expected behavior

What should have happened?

Ideally consul-template should run without errors. Maybe if the data is too large it could be broken up into multiple keys.

Another possible solution would be to ignore dedup if the data is too large, turning this error into a warning.

Failing that, it would be nice if consul-template gave me more details about what's wrong. I'd like to see what's in this chunk of data so I can try to reduce the size.

There's no old value, so I can't look at that. Looking at other values in the kv shows me that it's binary data, and I don't know how to decode or parse it.

$ /opt/consul/bin/consul kv get -detailed -base64 'consul-template/my-proxy/dedup/1442ff1543aaf681cfa72e1ef3401af6/data' 
Error! No key exists at: consul-template/my-proxy/dedup/1442ff1543aaf681cfa72e1ef3401af6/data

Any ideas for resolving this problem?

Incidentally, is there any cleanup process? I seem to have about 250 of these dedup values in consul.

Actual behavior

2018/08/31 21:45:19.605555 [ERR] (runner) failed to update dependency data for de-duplication: failed to write 'consul-template/my-proxy/dedup/1442ff1543aaf681cfa72e1ef3401af6/data': Unexpected response code: 413 (Value exceeds 524288 byte limit)

The consul docs explain that there's a 512-kB limit on values in the kv store: https://www.consul.io/docs/faq.html

Steps to reproduce

Reproducing this probably requires a large cluster. We're seeing it in one environment that has 300 instances registered for various services.

References

N/A

sodabrew commented 6 years ago

This also occurs at my site. The issue is that the healthcheck responses from each service are included in the deduplication body. There are other tickets where services that respond with, for example, a timestamp on the healthcheck page, trigger continuous service reloads despite the actual template contents not changing.

What we did was patch out the healthcheck contents before storing the deduplication body.

eikenb commented 5 years ago

Thanks for filing this issue.

If anyone has an idea how to replicate this in a isolated/test setup it would be of great help. Thanks.

sodabrew commented 5 years ago

This can be replicated in two ways:

Option 1: Very Large Healthcheck Page: For example if you use the / URL of the target service and it produces the default output for the site. Yes, this is bad form and it's better to have a dedicated healthcheck page, but it will trigger the same failure.

Option 2: Dynamic Healthcheck page -- for example showing the date/time or number of connections served by the process -- will cause copies of multiple checks of the healthcheck page to be stored in the dedup body. Might need a large number of services under check. A few hundred is sufficient, and I would quickly argue that this is not an excessive number of services to be watched for a single template.