hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.91k stars 1.95k forks source link

Nomad can't create consul token for consul-connect service #23728

Closed daniel-at-matt3r closed 3 months ago

daniel-at-matt3r commented 3 months ago

Nomad version

1.8.0

Operating system and Environment details

All nodes are Debian 12. Two federated consul clusters running in different cloud providers. ACL enabled, replication enabled and works fine. Two federated nomad clusters running in different cloud providers.

Issue

When submitting nomad job with consul-connect service using the following command: "nomad job run -region=gcp-us-west1 clearml-redis.hcl" on the nomad cluster connected to non-primary consul cluster job is failed with error: "Task hook failed: consul_si_token: Unexpected response code: 500 (rpc error making call: Local tokens are disabled)"

Reproduction steps

issue can be reproduced using consul cli: root@nomad-generic-clients-5lsz:~# consul acl token create -local -policy-name "consul-server-policy" Failed to create new token: Unexpected response code: 500 (rpc error making call: Local tokens are disabled)

Expected Result

Nomad job is running

Actual Result

Nomad job failed to start because nomad is trying to obtain Local token on non-primary federated consul cluster

Job file (if appropriate)

variable "nomad-namespace" { type = string description = "Nomad namespace to deploy envoy proxy to" default = "default" }

job "clearml-redis" { datacenters = ["*"] namespace = "${var.nomad-namespace}"

constraint { attribute = "${node.class}" operator = "regexp" value = "generic" }

group "clearml-redis" {

network {
  mode = "bridge"
}

volume "redis-vol" {
  type            = "csi"
  attachment_mode = "file-system"
  access_mode     = "single-node-writer"
  read_only       = false
  source          = "redis-vol"
}

service {
  name = "clearml-redis"
  port = "6379"
  connect {
    sidecar_service {}
  }
}

task "clearml-redis" {
  driver = "docker"

  volume_mount {
    volume = "redis-vol"
    destination = "/var/lib/redis/data"
    read_only = false
  }

  config {
    image = "redis:7.2.4-bookworm"
  }

  template {
    data = <<EOH
    REDIS_ROOOT_PASSWORD = "{{with secret "kv_hyades/data/redis/credentials"}}{{.Data.data.redis_root_password}}{{end}}"
    EOH

    destination = "secrets/redis.env"
    env = true
  }

  resources {
    memory = 1024
    cpu = 1000
  }
}

} }

Based on my research nomad is requesting Local token since version 1.3.0 - https://support.hashicorp.com/hc/en-us/articles/17767098217235-How-Nomad-Manages-ACL-Tokens-Polices-for-Consul-Service-Mesh

But how should it obtain token, which is Local, on non-primary cluster ?

daniel-at-matt3r commented 3 months ago

issue resolved. i missed enable_token_replication = true in consul acl configuration.

daniel-at-matt3r commented 3 months ago

solved, see above

tgross commented 3 months ago

Glad to hear you got it figured out @daniel-at-matt3r! And thank you for reporting how you fixed your issue -- that's very helpful when other people search issues.

neuroserve commented 3 months ago

Hi @daniel-at-matt3r, I have a similar setup. Did you set 'enable_token_replication = true' in your primary as well as your secondary consul cluster? And which version of consul are you running, anyway?

daniel-at-matt3r commented 3 months ago

Hi @neuroserve , Added to both primary and secondary DCes. But i think no need to have it in primary, logically primary designation should allow local token creation, can be wrong tho. @tgross, correct me if im on wrong direction. Just side notice - getting very robust nomad cluster, looking toward consul mesh gateway, very interesting tech.

neuroserve commented 3 months ago

In the tutorial https://developer.hashicorp.com/consul/docs/security/acl/acl-federated-datacenters?productSlug=consul&tutorialSlug=security-operations&tutorialSlug=access-control-replication-multiple-datacenters enable_token_replication = true is only set in the primary consul cluster.

I cannot set it in secondaries after the datacenters have been joined. And I think I wasn't able to join as long it was set in the secondaries. Can you tell me, what consul version you are using?

I have four consul datacenters federated with two mesh gateways each (with four nomad clusters "on top"). I detected that change in behaviour when I tried to configure auto_config in one of the secondaries (s. https://discuss.hashicorp.com/t/auto-config-vs-acl-token-replication/67612).

neuroserve commented 2 months ago

@daniel-at-matt3r Could you please still give us the consul version, you are using?

daniel-at-matt3r commented 2 months ago

@neuroserve , sorry missed you message. We are on consul 1.19.1 and Nomad 1.8.0

neuroserve commented 2 months ago

I'll try and set that up, too with "cross replication" enabled.

neuroserve commented 2 months ago

I cannot set up federated consul clusters with version 1.19.1 with enable_token_replication = true enabled in the primary and the secondary cluster. Neither setting it from the start nor setting one after the other works. Only setting it in the primary (and setting the replication token in the secondary) gives me a stable replication. Tokens with global scope are replicated and can be created on the primary as well as the secondary.

neuroserve commented 2 months ago

I get no token-error, when I try to run the above job. The tokens are created in consul, but with local scope. Now I need a test jobfile, that deploys services in both datacenters of the region which try to communicate via consul connect.

neuroserve commented 2 months ago

I can run the countdash-job

job "countdash" {
  datacenters = ["prod1"]
  node_pool = "static-clients"

  group "api" {
    network {
      mode = "bridge"
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {}
      }
    }

    task "web" {
      driver = "docker"

      config {
        image = "hashicorpdev/counter-api:v3"
      }
    }
  }

  group "dashboard" {
    network {
      mode = "bridge"

      port "http" {
        static = 9002
        to     = 9002
      }
    }

    service {
      name = "count-dashboard"
      port = "http"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port  = 8080
            }
          }
        }
      }
    }

    task "dashboard" {
      driver = "docker"

      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }

      config {
        image = "hashicorpdev/counter-dashboard:v3"
      }
    }
  }
}

in the nomad "above" the primary consul datacenter without problems. As soon as I change the datacenter to one of the secondaries, I can no longer start it (the envoy proxy cannot be started). Error message is:

Task hook failed: envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>

Envoy logs this:

failed fetch proxy config from local agent: Unexpected response code: 403 (Permission denied: token with AccessorID 'primary-dc-down' lacks permission 'service:read' on "count-api-sidecar-proxy")

I guess, that is due to the generated SI tokens being "local" and not global and are therefore not replicated by consuls ACL replication.

Am I missing something or is it no longer possible to run service mesh protected services within nomad and seperate cloud providers? Can someone explain how https://github.com/hashicorp/consul/issues/7381 was resolved? Do I have to change my config somewhere?