mrlesmithjr / ansible-consul

Ansible role to install/configure Consul
MIT License
8 stars 5 forks source link

Proposal2: rework encryption setup, service and binary ownership #27

Closed jardleex closed 7 years ago

jardleex commented 7 years ago

Hello @mrlesmithjr

during my work with consul and your role further change ideas appeared. I'd like to have your opinion on those:

Rework encryption key setup I've deployed two consul clusters in two different locations and joined then into a WAN gossip pool. While doing so I noticed that both clusters need to A know the encryption key of the remote cluster OR B both use one and the same encryption key. Our max deployment would cover ~9 locations which would lead to A 9 encryption keys requied in each location or B the requirement to be able to specify ONE key for all cluster. As A would not look very nice in configuration management I would at least give the role user the opportunity to define a own encryption key. While doing so I may rework the cluster.yml to make it a bit shorter/simpler.

Run consul service as consul user Currently the consul service is executed as root. I'd like to check if it may run as consul user Role default will remain root but the user may override that in his variables.

Allow consul command execution only to root I also like to give the user the option to deploy the consul binary as the consul user and let only him and root execute it. In large cluster where the consul members list is used as host inventory maybe not any system user/intrudor need to know who's alive around his system.

What do you think on those three topics?

Best regards

Jard

mrlesmithjr commented 7 years ago

@jardleex Of of the top of my head in regards to encryption keys I definitely see benefits in both. As you mentioned, only having one key would make things easier but at the same time there would be not much work to define separate keys for each location based on group_vars and be able to gather those keys during a role execution.

Definitely agree on the remaining two recommendations. Running consul as root is bad and I should have caught that before and limiting the execution to the consul binary would also make a lot of sense as well.

jardleex commented 7 years ago

Hi,

I digged a bit deeper into the encryption key thing.

From what I've learned until now is that passing a encryption key to the config.json is required only once - when the agent has not joined a cluster yet. This is the case when bootstrapping a new cluster or joining a agent to an existing cluster.

Changing the consul_encryption_key after the cluster has been bootstrapped has no effect to the running nodes at all. Each time you restart an agent you'll get these line in his log:

...
May 10 10:59:08 consul1 consul[26841]: ==> WARNING: LAN keyring exists but -encrypt given, using keyring
May 10 10:59:08 consul1 consul[26841]: ==> WARNING: WAN keyring exists but -encrypt given, using keyring
...

This tells me that the agent does not care about any encryption key set in the config.json. He only uses the keys in /var/consul/serf/local.keyring and /var/consul/serf/remote.keyring (only on servers) , which are managed by the consul keyring commands. So changing the encryption key is only possible via consul keyring commands or by totally destroying and rebuilding the cluster with a new key.

My first idea was to give the user the option to provide a encryption key and generate one if he ain't did gave one. But the auto generation and checking that the same key is deployed on new servers as well became more and more complex in my mind.

Proposal Provide a default _consul_encryptionkey and let the user overwrite it if desired. This provides a stable base on the key handling and reduces role complexity on behalf of automatic key generation and spreading to new servers. If the user want's to have other/new keys he have to utilize the consul keyring commands while updating his _consul_encryptionkey.

As a addition the role will check if the key deploy is present in /var/consul/serf/local.keyring. If not it will show a short notice that the key in config.json is not used at all by the agent. This may lead to errors if new nodes want to join the cluster.

So, what do you think? Go for the default key to keep it simple? Or try the fully automated way with a minor security improvement and a much higher role complexity?

Best

Jard

jardleex commented 7 years ago

Update on the encryption topic:

today I crushed one server of my three nodes setup by intention and stumpled opon the fact that I could not rejoin it to the cluster.

Story After the cluster was formed I set up another three node cluster running with different _consuldatacenter to test the wan gossip functions. For each cluster the role generated a dedicated encryption key. To join them together in a wan pool I replaced the key in the second cluster with the one from the first and joined all 6 servers. Afterwards I deleted the inital key on the second cluster. Then I crushed the 3rd server of the second cluster which no longer could join his local cluster.

Technical background The role told the crushed/rebuild node to use the encryption key in stored in /etc/consul.key on the consul_master which was still the inital, but no longer used one. As the role could not know about the key change, it could configure the node correctly. So thats a gap in the automatic keygen approach.

Solution thoughts As long as a consul wan federation uses one and the same key it's relative easy to find it on a functional member (just slurp /var/consul/serf/local.keyring) But I also have another test setup which use different keys per location but each location knows (and needs to know) all others locations keys. The second test setup is currently broken and I can't investigate furhter. But until now I aint fond a way to fetch the primary key of a location and pass it to upcoming nodes.

Until I fixed the second test setup to do more test, the only the 'statically' defined encryption key as a solution. But I have no idea yet how to tell the role user that the current cluster is running with another primary key then he provided in his vars.

While researching I found this Hashicop Issue which explains why THE primary key is needed to join new members.

Best

Jard

jardleex commented 7 years ago

Okay, second setup is fixed and I cloud do more tests. I found a way to make our all lives easier. The offical documentation on consul keyring says that consul supports multiple keys at a time BUT this is intendet to be transition state and not during normal operations.

So we'll only need to care about one single key at a time in all locations.

jardleex commented 7 years ago

Closing this as #30 has been merged.