hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.82k stars 1.94k forks source link

Add support for transparent authentication to the Task API #18125

Open aofei opened 1 year ago

aofei commented 1 year ago

Proposal

Given that #16872 is on the way and #16258 is likely planned, I'm thinking it might be a good idea to add support for transparent authentication to the Task API (aka ${NOMAD_SECRETS_DIR}/api.sock).

I'm not sure if this proposal is a good security practice. I just think it makes things easier.

Use-cases

For a couple of my current use cases for Workload Identity, #16258 is going to cause some trouble. For example, with #16258, every Workload Identity rotation will cause the Traefik job to restart. Unfortunately, that's the only way Traefik can use the rotated NOMAD_TOKEN (since Traefik has no reload mechanism).

I was thinking that with #16872, we could make Traefik take advantage of the Task API (https://github.com/traefik/traefik/issues/10044). By adding transparent authentication support to the Task API, we might be able to solve all such problems at once.

Think about this (assuming both #16872 and #16258 are fixed, along with https://github.com/traefik/traefik/issues/10044):

job "traefik" {
    group "traefik" {
        network {
            port "traefik_http" {}
            port "traefik_https" {}
        }

        task "traefik" {
            driver = "docker"
            config {
                image = "traefik:v3.0.0-beta3"
                args = ["traefik", "--configFile", "/local/traefik/config.yaml"]
                ports = ["traefik_http", "traefik_https"]
            }

            identity {
                api_sock    = true   # Add transparent authentication support to the Task API
                change_mode = "noop" # No need to restart the task
            }

            template {
                destination = "local/traefik/config.yaml"
                data = <<EOF
entryPoints:
  http:
    address: ":{{env "NOMAD_PORT_traefik_http"}}"
  https:
    address: ":{{env "NOMAD_PORT_traefik_https"}}"
    asDefault: true
    http:
      tls: {}
    http3: {}

providers:
  nomad:
    endpoint:
      # Just setting the address is enough, we don't need to set the token
      # because the Task API already has transparent authentication support.
      address: "unix://{{env "NOMAD_SECRETS_DIR"}}/api.sock"
    exposedByDefault: false
EOF
            }
        }
    }
}

Attempted Solutions

Perhaps add an api_sock option to the identity block:

identity {
    api_sock = true
}
tgross commented 1 year ago

Hi @aofei! As you might have guessed, the current behavior is intentional. Let me share the rationale from our internal design doc (here for HashiCorp folks):

The decision to require authentication via the Task API has a functional and aspiration purpose:

  • Functional: Requiring authentication allows the UDS [Unix Domain Socket] to always exist. Workloads may choose to use the Task API prior to unsetting their identity and executing an untrusted workload. Workloads such as proxies may choose to proxy the credentials supplied by users for some requests and use the workload identity for other requests.
  • Aspirational: Eventually I would like the Agent HTTP API to also require authentication on every request. This will take some future work to provide a migration path for existing clusters without ACLs enabled, but I think a way forward can be found with little friction (eg give the anonymous user management permissions in existing clusters, which is what having ACLs enabled implicitly does already). Requiring authentication for the Task API ensures users write code today for a future where all Nomad API access must be authenticated.

In fact a properly configured cluster, with a TLS-terminating proxy talking to the Task API, would not need any Agent HTTP endpoints available at all. The Task API would be the only entrypoint. See Future Work for details.

The Future Work section described there is:

Agent API UDS Since the UI needs to proxy API requests from browsers to Nomad’s HTTP API, all Nomad API interactions could go through the standalone UI job listed above. Some users already send all HTTP requests to Nomad through a load balancer.

At that point there would be no reason for Nomad Client Agents to bind on any ports. UDS API access could be provided for system administrators, but all Nomad cluster operations could realistically be expected to go through a proxy using the Task API as its backend.

I think you're potentially onto an interesting problem with Identity Expiration and Rotation though. I'm going to tag @schmichael (who's actively working on that) and @mikenomitch as a heads up on this to see if they have thoughts on how we might approach this.

schmichael commented 1 year ago

Excellent writeup and usecase @aofei. Thanks for pinging me Tim.

Workaround

socat, haproxy, or some other "simple" proxy could be used in a sidecar task to perform transparent authentication for other tasks in its network namespace. haproxy should allow reloading credentials via sighup to workaround traefik's missing feature there.

I mention this workaround not because it's nice -- it's not -- but because it should work until we get something official shipped. This workaround will continue to work in perpetuity for people who don't really mind a single purpose sidecar sitting around.

Secure by default

One quick note on the aspiration of making all Nomad requests authenticated: the JWKS endpoint added in #18035 will likely always need to be available unauthenticated (and potentially some other similar endpoints). However, I'd love to have to explicitly special case unauthenticated endpoints and have the default be auth'd.

Transparent Task API auth

Back to the subject at hand! Can we tie a Task API socket to an identity{}? I think so! +1 :shipit: and all that.

I like the optin design proposed so the default can continue to require explicit authentication. That's the only way to remain secure by default. However by tying a unix socket to an identity, you're accomplishing the same goal from the user's perspective: they're explicitly authenticating, just via the jobspec instead of at runtime! Neat!

In some sense this is more secure as there's no sensitive material inside the container that could be exfiltrated! You can even statically detect this behavior since its in a jobspec presumably on disk somewhere. Sentinel (in enterprise) could be used to prevent (or enforce!) its use.

Not sure identity.api_sock=<bool> is the right HCL though. With #18123 we would have to validate that at most 1 identity was being used with the socket. I don't think allowing a single task to have multiple socket files makes sense. I'm sure someone could dream up a use case, but the workaround above could be used for the 1% of cases that need such a complex setup.)

Right now I can't think of a reason you would want to use an alternate identity for your Task API, so maybe the validation is as simple as "only the default identity can have api_sock=true set."

If we wanted to kill 2 birds issues with 1 stone HCL block then we could do something like this:

task_api {
  unix_socket      = true
  transparent_auth = false
}

where unix_socket allows disabling the unix socket altogether to fix #16436 and transparent_auth would allow transparently using the default identity (even if you don't have an identity{} block defined!) to fix this issue.

idk... just brainstorming. Those names are a bit awkward, and it's a shame it allows for the invalid unix_socket=false + transparent_auth=true state to be defined, although some validation could catch that.

schmichael commented 1 year ago

One more neat side effect of transparent auth with the default token: expiration, and therefore rotation, wouldn't be necessary! No secrets would leave Nomad, and the socket's identity would be valid as long as the allocation is non-terminal.

I feel like I might be overlooking some gotcha here because it almost seems too good to be true. :sweat_smile:

(Note that expiration, and therefore rotation, are absolutely necessary for identities used with Consul, Vault, and other 3rd parties as they're not able to perform the "is this for a valid alloc?" association that Nomad itself is. We must rely on all the OIDC-ish JWT and JWKS infrastructure there.)

aofei commented 1 year ago

Hi @tgross! Thanks for sharing the internal design docs, especially the Future Work section. I'm glad to know that Agent API UDS is already on your roadmap. I completely agree that client agents shouldn't bind to any port.

I'm actually in favor of the secure-by-default design. I haven't used the agent's HTTP endpoint since the Task API was introduced. And a while back, I made all requests must go through the Task API. This was mainly because I encountered a bug that led to an access control bypass (which your team later identified as #16775). I thought I was the weird one, but it turns out you guys agree with this approach.

I must say, this use case could also benefit from the transparent authentication support.

Here's my current Nomad HTTP jobspec:

job "nomad-http" {
  group "nomad-http" {
    network {
      port "nomad_http" {}
    }
    task "nomad-http" {
      driver = "docker"
      config {
        image = "caddy"
        args = ["caddy", "run", "--config", "/local/caddy/Caddyfile"]
        ports = ["nomad_http"]
      }
      identity { env = true }
      template {
        destination = "local/caddy/Caddyfile"
        data = <<EOF
:{{env "NOMAD_PORT_nomad_http"}}
reverse_proxy unix//{{env "NOMAD_SECRETS_DIR"}}/api.sock {
  header_up +X-Nomad-Token "{{env "NOMAD_TOKEN"}}"
}
EOF
      }
    }
  }
}

The Task API rejects all requests without a token, which means there is no way to access the Nomad UI since it's displayed as "Unauthorized" without the ability to set the token in the browser. So, as you can see, I had to use a solution like Caddy to proxy requests for the Task API. This allowed me to set a default token for all requests without one.

However, with transparent authentication support, the implementation can be simplified to:

job "nomad-http" {
  group "nomad-http" {
    network {
      port "nomad_http" {}
    }
    task "nomad-http" {
      driver = "docker"
      config {
        image = "alpine/socat"
        args = [
          "TCP-LISTEN:${NOMAD_PORT_nomad_http},fork,reuseaddr",
          "UNIX-CONNECT:${NOMAD_SECRETS_DIR}/api.sock",
        ]
        ports = ["nomad_http"]
      }
    }
  }
}

Regarding this proposal, I initially forgot about #16436 and overlooked #18123. Given these two issues, it does seem that @schmichael's idea of introducing a new task_api block is indeed superior to introducing a new identity.api_sock option.