hashicorp / terraform-provider-consul

Terraform Consul provider
https://www.terraform.io/docs/providers/consul/
Mozilla Public License 2.0
125 stars 113 forks source link

[Feature] resource "consul_node_token" #197

Open matthiasng opened 4 years ago

matthiasng commented 4 years ago

It would be nice to have better ACL support for the bootstrap phase.

In #95, remilapeyre already pointed out the problem of consul acl boostrap. Apart from the problem of managing the acl bootstrap command, there is no way to set a terraform managed token, as agent default/agent/master token.

It is already possible to set agent tokens with v1/agent/token/[default|agent|agent_master|replication] but not to read them. So this feature depends on changes in consul itself.

@remilapeyre What do you think ? When you say the new resource would be useful, i will start to get the required changes into consul.

remilapeyre commented 4 years ago

Thanks, I see what it does now. I think it may be complicated to accept a resource that does this thought:

First the token is specific to each agent so I think the consul_node_token would need to be repeated for each agent in the cluster and each one of them would need to be able to be contacted by Terraform which is not always possible (all other resources are set globally in the cluster by making calls directly to a Consul server).

Also I think this would result in all secret tokens being saved in the Terraform state, which we are looking to avoid. For example the consul_acl_token resource does not return the secret id, and while you can read it using the consul_acl_token_secret_id datasource there is a huge disclaimer that you should not use this unless you are absolutely sure you can't avoid it.

Does this seem correct?

matthiasng commented 4 years ago

First the token is specific to each agent so I think the consul_node_token would need to be repeated for each agent in the cluster and each one of them would need to be able to be contacted by Terraform which is not always possible (all other resources are set globally in the cluster by making calls directly to a Consul server).

That is something I havent thought about. I work with a small cluster where I have access to all the nodes, so that is no problem for me. And you are right, you need a provider with alias for every node.

Also I think this would result in all secret tokens being saved in the Terraform state, which we are looking to avoid. For example the consul_acl_token resource does not return the secret id, and while you can read it using the consul_acl_token_secret_id datasource there is a huge disclaimer that you should not use this unless you are absolutely sure you can't avoid it.

This souldn't be a problem. In my draft the resource only accepts accessor_id's. Of course, your provider token requires access to resolve the accessor_id to a secert_id. This way i can set the secert_id, without writing it to the state. (Same as "consul_acl_token_policy_attachment")

remilapeyre commented 4 years ago

I work with a small cluster where I have access to all the nodes, so that is no problem for me.

While it won't be useful in many deployment, I understand that I can be very useful in some cases so I guess it's a legitimate resource.

This souldn't be a problem. In my draft the resource only accepts accessor_id's.

Oh right, that's good news.

You can set Update to nil since thee resource won't be updatable, doing nothing in Read and Delete could be acceptable. Some other resources like aws_db_instance do set some of there fields when reading, I don't know if some set none of there fields thought I will have a look at the other providers to see if they have a resource like this one.

you need a provider with alias for every node.

I think you should be able to avoid this by adding an optional address field:

provider "consul" {
  address = "consul-0.local"
}

resource "consul_acl_policy" "agent" {
  name  = "agent"
  rules = <<-RULE
    node_prefix "" {
      policy = "write"
    }
    RULE
}

resource "consul_acl_token" "agent_token" {
  description = "my test token"
  policies = [consul_acl_policy.agent.name]
}

# Set the agent token of consul-0.local
resource "consul_node_token" "agent_token_0" {
  type        = "agent"
  accessor_id = consul_acl_token.agent_token.accessor_id
}

# Set the agent token of consul-1.local
resource "consul_node_token" "agent_token_1" {
  address  = "consul-1.local"
  type        = "agent"
  accessor_id = consul_acl_token.agent_token.accessor_id
}

it would make it more flexible and a bit more useful.

Regarding the name it seems to me that consul_agent_token would be more in line with the HTTP API and this new resource would also need documentation.

matthiasng commented 4 years ago

I have done some futher testing based on your suggestions and i like the result. With an additional address field, it's possible to manage all agent tokens in less than 30 lines.

Example:

provider "consul" {
  address = "localhost:8500"
}

data "consul_nodes" "all" {}

resource "consul_acl_policy" "agent_policies" {
  count = length(data.consul_nodes.all.nodes)

  name  = "agent-token-${data.consul_nodes.all.nodes[count.index].name}"
  rules = <<-RULE
    node "${data.consul_nodes.all.nodes[count.index].name}" {
      policy = "write"
    }
  RULE
}

resource "consul_acl_token" "agent_tokens" {
  count = length(data.consul_nodes.all.nodes)

  description = "node policy for ${data.consul_nodes.all.nodes[count.index].name}"
  policies    = [consul_acl_policy.agent_policies[count.index].name]
}

resource "consul_agent_token" "agent_tokens" {
  count = length(data.consul_nodes.all.nodes)

  type        = "agent"
  address     = "${data.consul_nodes.all.nodes[count.index].address}:8500"
  accessor_id = consul_acl_token.agent_tokens[count.index].accessor_id
}

I also thought about the problem you metioned, that every consul agent must be contacted by Terraform. Im interested in considering this. It's no problem for my current cluster, but as soon as we move some services to an external cluster, this will become a problem. I also like to support as much use cases as possible.

After some research, i'm still not sure if consul can handle this with some changes. I will create a consul feature request to get some feedback, before continuing.

remilapeyre commented 4 years ago

as soon as we move some services to an external cluster, this will become a problem. I also like to support as much use cases as possible.

I don't think you can support every use-cases here. Consul support sparse connectivity (https://www.consul.io/docs/enterprise/federation) for weird network topologies that will be complicated to support.

I will create a consul feature request to get some feedback, before continuing.

What do you have in mind?

matthiasng commented 4 years ago

I also like to support as much use cases as possible.

Maybe a bit too much.

I thought about something like:

PUT v1/agent/token/agent
{
    "token": "..."
    "node": "<node-name>"
}

For Terraform, then the resource looks "global". Am i right ?

remilapeyre commented 4 years ago

Yes but how would you know who is <node-name>? You can't know who it is before some consul client connected to the server using a token with the appropriate permissions, so you will have a chicken and egg problem here.

matthiasng commented 4 years ago

As soon as a new agent starts (acl: enabled, no agent token configured), it will be register as new consul member (visible in server logs and ui).

==> Consul agent running!
    2020-05-28T20:12:15.523Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: 29c90c73c1ee 172.26.0.3
    ...
    2020-05-28T20:12:15.523Z [INFO]  agent.client: adding server: server="29c90c73c1ee (Addr: tcp/172.26.0.3:8300) (DC: dc1)"
    2020-05-28T20:12:15.523Z [INFO]  agent: (LAN) joined: number_of_nodes=1
    ....
    2020-05-28T20:12:16.742Z [ERROR] agent.client: RPC failed to server: method=Catalog.Register server=172.26.0.3:8300 error="rpc error making call: rpc error making call: Permission denied"

Joining nodes is handled by Serf. And i think Serf is ACL indepentent ? Also note the Permission denied sync error.

As soon as the node knows a server, we can call v1/agent/token/... with, for example, our global management token. The token sould have a policy with node_key "".

$ curl -XPUT -d '{"token": "<agent-token>"}' -H "X-Consul-Token: <global-token>" http://consul-1:8500/v1/agent/token/agent 

consul log:
    2020-05-28T20:33:34.354Z [INFO]  agent: Updated agent's ACL token: token=agent
    2020-05-28T20:33:36.356Z [INFO]  agent: Synced node info

Am i right or do i miss something ? It looks strange but thats my result from a short manual test.

remilapeyre commented 4 years ago

Am i right or do i miss something ?

I was wrong, I just tried on a test cluster and even with the ACL to deny and the client can register. That's a bit weird I think, doesn't it mean that anybody can try to masquerade as a client?

It looks like there is a lot of changes coming in this space in the next months, we should probably wait before merging anything. In the meantime you should probably be able to use the resource you wrote on your cluster without issue. If you want me to review it, please push it on the PR you already created and I will tell you if I see anything out of place :)

matthiasng commented 4 years ago

I have updated the PR (tests are missing).

I also added some docs just in case someone else want to use the provider until consul is ready to implement a better solution.

I was wrong, I just tried on a test cluster and even with the ACL to deny and the client can register. That's a bit weird I think, doesn't it mean that anybody can try to masquerade as a client?

I totally agree. It also looks like 1.8 beta behaves differently. Nodes are visible through the UI but the catalog API doesn't return them. Maybe I stumbled over a bug here or the behaviour just has changed.