Consul Connect service health checks not accessible?

evandam commented 3 years ago

Nomad version

Nomad v1.0.2 (4c1d4fc6a5823ebc8c3e748daec7b4fda3f11037)

Operating system and Environment details

Ubuntu 18.04

Issue

When running a service binding a port locally (ex 127.0.0.1:8080), it seems that Consul health checks cannot access them, and I'm unable to use options like expose or address_mode.

I would expect this to be a pretty common approach if I understand correctly (to avoid leaking ports that could be accessed outside of Consul Connect). Can the guides/docs add steps for health checks in https://www.nomadproject.io/docs/integrations/consul-connect?

Reproduction steps

Using the following job, try adding expose = true or address_mode = "driver" to the check and note the errors.

With expose = true:

❯ nomad job run debug/python_http.hcl
Error submitting job: Unexpected response code: 500 (error in job mutator expose-check: unable to determine local service port for service check app->python-http->python-http-health)

This happens even if I pass port = "8080" in the check configuration.

With address_mode = "driver":

The job is deployed, but the task fails with the following log:

failed to setup alloc: pre-run hook "group_services" failed: error getting address for check "python-http-health": cannot use address_mode="driver": no driver network exists

Job file (if appropriate)

job "python-http" {
  datacenters = ["kitchen"]

  group "app" {
    network {
      mode = "bridge"
      port "http" {}
    }

    task "python-http" {
      driver = "docker"

      config {
        image = "python:3"
        command = "python3"
        args = [
          "-m",
          "http.server",
          "-b",
          "127.0.0.1",
          "${NOMAD_PORT_http}",
        ]
      }

      env {
        PYTHONUNBUFFERED = "1"
      }

      resources {
        cpu = 20
        memory = 100
      }
    }

    service {
      name = "python-http"
      port = "http"

      check {
        type     = "http"
        name     = "python-http-health"
        path     = "/"
        interval = "10s"
        timeout  = "3s"
        # address_mode = "driver"
        # expose = "true"
      }

      connect {
        sidecar_service {}
      }
    }
  }
}

evandam commented 3 years ago

After a decent amount of trial and error, it looks like an issue with named ports instead of hard-coded ports.

I'm not sure if this is a bug or expected behavior, but it's certainly confusing. Any chance docs could capture this either way?

idrennanvmware commented 3 years ago

@evandam given you're running in mesh, is there a reason you aren't using hard coded ports? Since it's all internal there's no chance of conflict. Here's an example of how we're doing it

 group "<redacted>-group" {
    count = [[ .api.count ]]

    constraint {
      attribute = "${meta.general_compute_linux}"
      value     = "true"
    }

    network {
      mode = "bridge"
      port "exposed"{}
    }

    service {
      name         = "<redacted>"
      tags         = [ "http" ]
      port         = "9090"
      check {
        expose   = true
        type     = "http"
        port     = "exposed"
        path     = "/hc"
        interval = "10s"
        timeout  = "5s"
      }

      connect {
        sidecar_service {
          proxy {}
        }
      }
    }

and our task (snipped)

   task "<redacted>" {
      driver = "docker"

      config {
        image        = "<redacted>"
        volumes      = [
          "local/overrides:/app/overrides"
        ]
        cpu_hard_limit = true
      }

      env {
        ASPNETCORE_URLS         = "http://+:9090"
      }

      resources {
        cpu    = [[ .api.resources.cpu ]] # Mhz
        memory = [[ .api.resources.memory ]] # MB
      }
    }
  }

evandam commented 3 years ago

Hey @idrennanvmware, after learning this was the issue there's not necessarily a requirement to use named ports, but generally I like using them for readability. I also wouldn't have expected the behavior to be different when using named/hard-coded ports, so it just seems like a point of confusion.

krishicks commented 3 years ago

Hey @evandam! Thanks for raising the issue.

What do you think about the following update?

The port in the service stanza is the port the API service listens on. The
Envoy proxy will automatically route traffic to that port inside the network
-namespace.
+namespace. Note that this cannot be a named port; it must be a hard-coded port
+value.

evandam commented 3 years ago

Sounds good to me, thanks!

xeroc commented 3 years ago

This explains my issue here.

Thanks for making it clear

tgross commented 3 years ago

https://github.com/hashicorp/nomad/pull/10225 will fix the docs, and I'm going to keep this issue open as a feature request to fix.

mircea-c commented 3 years ago

Any timeline on this fix at the moment? It's a real pain not being able to use dynamic ports in service definitions.

Oloremo commented 2 years ago

Any updates on this?

bradydean commented 1 year ago

I've noticed that dynamic port labels can be used without causing any errors (granted I still have errors, but I think they're unrelated). Is this expected now?

ElectroTiger commented 1 year ago

As of October 2023, the workaround documented here seems to enable usage of dynamic ports: https://discuss.hashicorp.com/t/port-mapping-with-nomad-and-consul-connect/16738/5

hashicorp / nomad