Closed iamlittle closed 7 years ago
@iamlittle If you're concerned about this happening, then you need to pass in -node-id
to set each instance of Consul to have a unique node ID. I'd argue (strongly) that this is in fact the correct behavior and that if you want to allow duplicate consul nodes on the same host you need to explicitly disambiguate them. Consul doesn't detect anything about running inside of a Kube environment so if there is a better ID to pull from if Consul is running under Kube, lmk.
@sean- Thanks! I was looking for something like that in the docs. Guess I missed it.
We will add a note to the docs and maybe even the error message to help people find the -node-id
option.
Something like -node-id=$(uuidgen | awk '{print tolower($0)}')
added to the command line should get you a unique node ID.
@slackpad That sounds good, but I believe an option in consul to force the generation of the node id from another source would be very useful. I'm facing the same issue as @iamlittle. Your solution sounds reasonable, but it requires uuidgen to be available in the container. This is not the case when using the official consul docker images, for instance.
@mgiaccone cat /proc/sys/kernel/random/uuid
will give you a uuid and is available in the docker container.
@iamlittle Thanks, I just solved it with the same command
@mgiaccone that's fair - depending on how many people bump into this we may need to add an option to generate a uuid internally - we've got the code in there, it's just a tradeoff on adding more config complexity.
Is it me or does the -node-id=$(cat /proc/sys/kernel/random/uuid)
not work yet in version 0.8.0?
Tried the following commands:
docker run -d --name consul-01 -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' consul agent -server -bind=0.0.0.0 -client=0.0.0.0 -retry-join=172.17.0.2 agent ip -bootstrap-expect=3 -node-id=$(cat /proc/sys/kernel/random/uuid)
docker run -d --name consul-02 -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' consul agent -server -bind=0.0.0.0 -client=0.0.0.0 -retry-join=172.17.0.2 agent ip -bootstrap-expect=3 -node-id=$(cat /proc/sys/kernel/random/uuid)
docker run -d --name consul-03 -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' consul agent -server -bind=0.0.0.0 -client=0.0.0.0 -retry-join=172.17.0.2 agent ip -bootstrap-expect=3 -node-id=$(cat /proc/sys/kernel/random/uuid)
This doesn't work for me.
[update] Sorry, my bad. I still had the words "agent ip" in the command-line. [/update]
@slackpad After having upgraded from 0.7.5 to 0.8.0, I have hit a bug of identical node ids. My use case is that I have LXD (LXC) containers and the dmidecode output from inside all the LXD containers is the same as that of the physical host.
These are long running LXD containers which can stop and start over time.
If I were to pass the the -node-id
parameter to the Consul startup, the node-id would be different on each startup. In such a case, would it matter? Or, would it use the saved (persisted) node-id from the previous ?
For now, I have reverted to v0.7.5
Thanks and Regards, Shantanu
@slackpad
... answering my own question ... 😄
As expected the changing node-id
(as specified on the command-line) does matter, but only until the health checks pass and the nodes de-register and re-register successfully.
For testing, if I restart the nodes (lxc containers) within a short span of time, I do see the message: "consul.fsm: EnsureRegistration failed: failed inserting node: node ID..." and then ... "member XXXXX left, deregistering"
The node joins in successfully after the health checks, so for me, things are working fine with v0.8.0 for now.
Regards, Shantanu
A "better" IMO way to set the node-id is with something like this:
cat /proc/sys/kernel/random/uuid > "$CONSUL_DATA_DIR"/node-id
and then start your consul agent/server as per usual (pre 0.8) practice. Reading the code you can see that consul first checks for the existence of this file before trying to generate a new node-id. That way, if you restart your container, it will keep a stable node-id.
@mterron thanks! I have Ubuntu 14.04/16.04 LXD (LXC) containers.
I will have to come up with a startup logic of "execute only once, if node-id file doesn't exist" in the init script and the systemctl equivalent, so that the node-id file get generated only once!
It's straightforward for the 14.04 upstart script, will check up on how to easily achieve for the systemctl equivalent 😦
Thanks and Regards, Shantanu
Changing this to enhancement - I think we should add a configuration to disable the host-based ID, which will make a random one if needed inside of Consul itself, and then save that to the data dir for persistence. This will make life easier for people trying to do this in Docker.
Thanks @slackpad Eagerly awaiting the next release! 🙌
What's the scenario where you want consul to use the boot_id as node id? Generating a random node id by default seems more intuitive but I'm sure I'm missing something here.
I mean, instead of having the -disable-host-node-id flag, I'd just add a -enable-host-node-id for the people that specifically need that behaviour.
@mterron Nomad uses the same host-based IDs so it's nice to have the two sync by default (you can see where a job is running and go to
I've never used Nomad so boot_id seemed like an arbitrary choice for a random identifier but it sort of makes sense from a Hashicorp ecosystem point of view.
2 lines on the documentation should be enough to explain the default behaviour so that users are not surprised. Something like: "By default Consul will use the machine boot_id (/proc/sys/kernel/random/boot_id) as the node-id. You can override this behaviour with the -disable-host-node-id flag or pass your own node-id using the -node-id flag." or something like that.
Thanks for replying to a closed issue!
Hi @mterron we ended up adding something like that to the docs - https://www.consul.io/docs/agent/options.html#_node_id:
-node-id - Available in Consul 0.7.3 and later, this is a unique identifier for this node across all time, even if the name of the node or address changes. This must be in the form of a hex string, 36 characters long, such as adf4238a-882b-9ddc-4a9d-5b6758e4159e. If this isn't supplied, which is the most common case, then the agent will generate an identifier at startup and persist it in the data directory so that it will remain the same across agent restarts. Information from the host will be used to generate a deterministic node ID if possible, unless -disable-host-node-id is set to true.
consul version
for both Client and ServerServer:
Consul v0.8.0
consul info
for both Client and ServerServer:
Operating system and Environment details
Ubuntu 16.04.1 LTS Kubernetes 1.5
Description of the Issue (and unexpected/desired result)
Trying to join containerized consul servers on the same machine will throw an error due to
/proc/sys/kernel/random/boot_id
being identical across all containers on a host.Reproduction steps
Running Consul 0.8.0 in a 3 pod replica set on a single node Kubernetes cluster (development machine). Deployment definition
consul agent join X.X.X.X
throws the error:I believe this to be a result of #2700. In any case, 0.8.0 could cause some serious problems in Kubernetes clusters if the 2 Consul pods were to be scheduled on the same machine. This may not occur immediately.