hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.44k stars 4.43k forks source link

Deregistering sidecar does not deregister the destination service #9798

Open tjhiggins opened 3 years ago

tjhiggins commented 3 years ago

Overview of the Issue

Running consul in a k8s cluster. Sometimes the sidecar deregister fails to run when a pod gets deleted. The sidecar gets removed after the deregister_critical_service_after timeout, but the original service remains.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with n client nodes n and n server nodes
  2. Register a service with a sidecar (remove preStop deregister logic to easily reproduce)
  3. Delete pod

Proposal

Option 1: Remove the service when its sidecar is deregistered

Option 2: Allow for an alias_service check to the service sidecar

I tried the following (Proxy Alias Check), but it says the service sidecar does not exist on the node:

services {
  id   = "${SERVICE_ID}"
  name = "test-service"
  address = "${POD_IP}"
  port = 8080
  meta = {
    pod-name = "${POD_NAME}"
  }
  enable_tag_override = true

  # Expected this to work
  checks {
    name = "Proxy Alias"
    alias_service = "${PROXY_SERVICE_ID}"
  }
}

services {
  id   = "${PROXY_SERVICE_ID}"
  name = "test-service-sidecar"
  kind = "connect-proxy"
  address = "${POD_IP}"
  port = 20000
  tags = []
  meta = {
    pod-name = "${POD_NAME}"
  }
  enable_tag_override = true

  proxy {
    destination_service_name = "test-service"
    destination_service_id = "${SERVICE_ID}"
    local_service_address = "127.0.0.1"
    local_service_port = 12345
    ${init_container.value.upstreams}
  }

  checks {
    name = "Proxy Public Listener"
    tcp = "${POD_IP}:20000"
    interval = "10s"
    # Set deregister_critical_service_after to be super low to reproduce
    deregister_critical_service_after = "10s"
  }

  checks {
    name = "Destination Alias"
    alias_service = "${SERVICE_ID}"
  }
}
blake commented 3 years ago

Hi @tjhiggins, what version of Consul and Consul-k8s are you using?

The latest version of consul-k8s (0.24.0) contains a new cleanup controller which I believe may help address this issue.

Connect: add new cleanup controller that runs in the connect-inject deployment. This controller cleans up Consul service instances that remain registered despite their pods being deleted. This could happen if the pod's preStop hook failed to execute for some reason. [GH-433]

tjhiggins commented 3 years ago

@blake Thanks for the quick response. I saw that - which is awesome, but I feel like this should be core functionality for non-k8s use-cases.

We unfortunately cannot use the connect inject controller because we need to support exposing multiple ports. So we have custom terraform that creates multiple envoy proxies etc. We are also planning on doing something similar for ECS and wouldn't have access to a cleanup controller.

Edit: My workaround at the moment is to attach the "Proxy Public Listener" check to both services, but that isn't ideal.