Open beninghton opened 1 year ago
@tgross btw I tested this case with 5 nomad server cluster, and the behaviour is exactly the same. I think that's not acceptable at all. 5 server cluster should be alive even after 2 nodes failure. I guess we will see the same result on any number of server nodes. It does not make sense to add more nodes to provide better High Availability because it's not right HA cluster behaviour.
Hi @beninghton and thanks for raising this issue.
We expected that a new leader would be elected and the cluster would not fail.
Yes I would agree and this should be the behaviour you are experiencing. The server logs you have included only have the error messages. These logs at a glance seem to be expected, as the server is unable to connect with the blocked server within the gossip pool. Do you have these logs available in debug mode, so we can see what other actions the server is performing? Without additional information like this, it's hard to understand what the problem might be.
If possible, could you also share the configuration you are using for the server and clients, retracted as required? I'll currently focus on this issue for communication, as I expect the information provided will also be relevant to https://github.com/hashicorp/nomad/issues/17974
@jrasell Unfortunately we didn't have "DEBUG" log level enabled, only default "INFO". But now I've enabled debug and I'm gonna share the servers configuration server.hcl config we use:
server {
enabled = true
bootstrap_expect = 3
heartbeat_grace = "30s"
}
acl {
enabled = true
}
vault {
enabled = true
address = "http://vault:8200"
create_from_role = "nomad-cluster"
token = "secret_token"
}
log_level = "DEBUG"
enable_debug = true
And this is nomad.hcl:
datacenter = "dc1"
data_dir = "/opt/nomad/data"
telemetry {
collection_interval = "1s"
disable_hostname = true
prometheus_metrics = true
publish_allocation_metrics = true
publish_node_metrics = true
}
And this is client configuration client.hcl:
bind_addr = "{{ GetInterfaceIP \"ens160\" }}"
client {
enabled = true
node_class = "common"
}
plugin "docker" {
config {
allow_privileged = true
volumes {
enabled = true
}
auth {
config = "/opt/nomad/data/docker-auth.json"
}
allow_caps = [
"CHOWN",
"DAC_OVERRIDE",
"FSETID",
"FOWNER",
"MKNOD",
"SETGID",
"SETUID",
"SETFCAP",
"SETPCAP",
"NET_BIND_SERVICE",
"SYS_CHROOT",
"KILL",
"AUDIT_WRITE",
"NET_RAW"
]
extra_labels = [
"job_id",
"job_name",
"task_group_name",
"task_name",
"namespace",
"node_id",
"node_name"
]
}
}
plugin "raw_exec" {
config {
enabled = false
}
}
acl {
enabled = true
}
vault {
enabled = true
address = "http://vault:8200"
}
This information is relevant to https://github.com/hashicorp/nomad/issues/17974 and https://github.com/hashicorp/nomad/issues/18063. Btw I'm sure that you will easily reproduce this problem on your environment using Iptables command I posted.
Same here. Tested on Nomad versions 1.8.4, 1.9.1.
Super easy to reproduce: Go to a leader node and run this one liner:
export SLEEP=60 ; \
iptables -I INPUT 1 -p tcp --match multiport --dports 4646:4648 -j REJECT ; \
iptables -I INPUT 1 -p udp --match multiport --dports 4646:4648 -j REJECT ; \
sleep $SLEEP ; \
iptables -D INPUT -p tcp --match multiport --dports 4646:4648 -j REJECT ; \
iptables -D INPUT -p udp --match multiport --dports 4646:4648 -j REJECT
It will block input ports 4646-4648 for 60s and then unblock (delete rules). No election happens. And the most critical - all allocations in the cluster will be recreated once the above command completes.
It is worth to say - if I block INPUT and OUTPUT the election will happen, a new leader is elected and all good. If only INPUT is blocked for at least 60s then it is a big problem.
You may think it is a non-realistic scenario but no - it is really happened on Azure Cloud by itself when a VM being a leader got input somehow blocked and no elections happened for hours... (
Nomad version
Nomad v1.5.6 BuildDate 2023-05-19T18:26:13Z Revision 8af70885c02ab921dedbdf6bc406a1e886866f80
Cluster structure
3 master nodes: 10.1.15.21 - leader 10.1.15.22 10.1.15.23 2 client nodes: 10.1.15.31 10.1.15.32 3 consul cluster nodes: 10.1.15.11 10.1.15.12 10.1.15.13
Operating system and Environment details
Fedora release 35 (Thirty Five)
Issue
We had an issue with nomad cluster, master nodes had high ram/cpu consuming and in logs we saw only these CSI related errors:
[ERROR] nomad.csi_plugin: csi raft apply failed: error="plugin in use" method=delete
Then leader node was hanged at all, and we saw that cluster failed, client nodes went down. We saw in logs that new leader was not elected.Reproduction steps
We had to re-create the failed nomad master from scratch and remove CSI from our cluster. But we've managed to reproduce this issue another way - just closed 4647 port on master leader node:
iptables -A INPUT -p tcp --destination-port 4647 -j DROP
We assume that it imitates the issue we had with CSI because it blocks not all, but a part of functionality of master node, what likewise we think happened when CSI hanged our leader node.Expected Result
We expected that a new leader would be elected and the cluster would not fail.
Actual Result
New leader was not elected, client nodes were down.
nomad server members
output on leader node (where 4647 is blocked):nomad server members
output on non-leader node:Nomad logs
server1-leader.log server2.log server3.log client1.log client2.log