hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.41k stars 4.43k forks source link

error filling agent cache: Internal cache failure #8657

Open saez0pub opened 4 years ago

saez0pub commented 4 years ago

Overview of the Issue

Hello,

I have a nomad job with a connect enabled backend. sometimes, It produces an error log in consul agent :

error filling agent cache: Internal cache failure: service '_nomad-task-XXX' not in agent state

Reproduction Steps

Here is an easy reproduction method

$ nomad -version
Nomad v0.12.4 (8efaee4ba5e9727ab323aaba2ac91c2d7b572d84)
$ consul -version  
Consul v1.8.3
Revision a9322b9c7
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Create the example nomad jobs

cat <<EOF > backend.nomad
job "countdash" {
  datacenters = ["dc1"]

  group "api" {
    network {
      mode = "bridge"
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {}
      }
    }

    task "web" {
      driver = "docker"

      config {
        image = "hashicorpnomad/counter-api:v1"
      }
    }
  }
}
EOF

cat <<EOF > batch_with_connect.nomad
job "batch_with_connect" {
  datacenters = ["dc1"]
  type = "batch"
  periodic {
    cron = "*/15 * * * * * *"
    prohibit_overlap = true
  }
  group "batch_with_connect" {
    network {
      mode = "bridge"
    }
    service {
      name = "count-dashboard"
      port = "9002"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port  = 8080
            }
          }
        }
      }
    }
    task "echo" {
      driver = "docker"
      config {
        image = "debian:10"
        args = ["echo", "done"]
      }
    }
  }
}
EOF

Start consul

consul agent -dev -log-level=TRACE 2>&1 | tee consul.log

Start nomad

sudo nomad agent -dev-connect -log-level=TRACE 2>&1 | tee nomad.log

Start backend job on nomad (wait it is fully started)

$ nomad job run backend.nomad

$ [ -z "$(curl -s http://127.0.0.1:8500/v1/agent/checks | grep Status  | grep -v passing)" ] && echo OK
OK

Start batch job on nomad

$ nomad job run batch_with_connect.nomad

Wait an error in consul logs (job launches every 15 seconds)

grep "ERROR.*cache" consul.log

Consul info for both Client and Server

Consul info ``` agent: check_monitors = 0 check_ttls = 0 checks = 6 services = 6 build: prerelease = revision = a9322b9c version = 1.8.3 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = true leader_addr = 127.0.0.1:8300 server = true raft: applied_index = 44 commit_index = 44 fsm_pending = 0 last_contact = 0 last_log_index = 44 last_log_term = 2 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:506cf337-31df-4815-01b0-90606fc638f6 Address:127.0.0.1:8300}] latest_configuration_index = 0 num_peers = 0 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 2 runtime: arch = amd64 cpu_count = 8 goroutines = 123 max_procs = 8 os = linux version = go1.14.7 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 1 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1 members = 1 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Linux

Log Fragments

Logs extract

    2020-09-11T09:59:31.292+0200 [TRACE] agent.proxycfg: resetting watches for discovery chain: service_id=_nomad-task-4f11511e-782f-27e0-1480-c64812a66525-group-batch_with_connect-count-dashboard-9002-sidecar-proxy id=count-api
    2020-09-11T09:59:31.292+0200 [TRACE] agent.proxycfg: initializing watch of target: service_id=_nomad-task-4f11511e-782f-27e0-1480-c64812a66525-group-batch_with_connect-count-dashboard-9002-sidecar-proxy upstream=count-api chain=count-api target=count-api.default.dc1 mesh-gateway-mode=
    2020-09-11T09:59:31.341+0200 [DEBUG] agent.http: Request finished: method=GET url=/v1/agent/service/_nomad-task-4f11511e-782f-27e0-1480-c64812a66525-group-batch_with_connect-count-dashboard-9002-sidecar-proxy from=127.0.0.1:36134 latency=203.509µs
    2020-09-11T09:59:32.223+0200 [DEBUG] agent: removed check: check=service:_nomad-task-4f11511e-782f-27e0-1480-c64812a66525-group-batch_with_connect-count-dashboard-9002-sidecar-proxy:1
    2020-09-11T09:59:32.223+0200 [ERROR] agent.proxycfg: watch error: service_id=_nomad-task-4f11511e-782f-27e0-1480-c64812a66525-group-batch_with_connect-count-dashboard-9002-sidecar-proxy id=service-http-checks:_nomad-task-4f11511e-782f-27e0-1480-c64812a66525-group-batch_with_connect-count-dashboard-9002 error="error filling agent cache: Internal cache failure: service '_nomad-task-4f11511e-782f-27e0-1480-c64812a66525-group-batch_with_connect-count-dashboard-9002' not in agent state"
    2020-09-11T09:59:32.223+0200 [DEBUG] agent: removed check: check=service:_nomad-task-4f11511e-782f-27e0-1480-c64812a66525-group-batch_with_connect-count-dashboard-9002-sidecar-proxy:2
    2020-09-11T09:59:32.223+0200 [DEBUG] agent: removed service: service=_nomad-task-4f11511e-782f-27e0-1480-c64812a66525-group-batch_with_connect-count-dashboard-9002-sidecar-proxy
    2020-09-11T09:59:32.223+0200 [ERROR] agent.proxycfg: watch error: service_id=_nomad-task-4f11511e-782f-27e0-1480-c64812a66525-group-batch_with_connect-count-dashboard-9002-sidecar-proxy id=service-http-checks:_nomad-task-4f11511e-782f-27e0-1480-c64812a66525-group-batch_with_connect-count-dashboard-9002 error="error filling agent cache: Internal cache failure: service '_nomad-task-4f11511e-782f-27e0-1480-c64812a66525-group-batch_with_connect-count-dashboard-9002' not in agent state"
    2020-09-11T09:59:32.223+0200 [INFO]  agent: Deregistered service: service=_nomad-task-4f11511e-782f-27e0-1480-c64812a66525-group-batch_with_connect-count-dashboard-9002-sidecar-proxy

Full logs

shoenig commented 4 years ago

Looks like another user is reporting similar symptoms over at https://github.com/hashicorp/nomad/issues/9307