Problems with connection to Nomad master (leader) causes whole cluster to fail

beninghton commented 1 year ago

Nomad version

Nomad v1.5.6 BuildDate 2023-05-19T18:26:13Z Revision 8af70885c02ab921dedbdf6bc406a1e886866f80

Cluster structure

3 master nodes: 10.1.15.21 - leader 10.1.15.22 10.1.15.23 2 client nodes: 10.1.15.31 10.1.15.32 3 consul cluster nodes: 10.1.15.11 10.1.15.12 10.1.15.13

Operating system and Environment details

Fedora release 35 (Thirty Five)

Issue

We had an issue with nomad cluster, master nodes had high ram/cpu consuming and in logs we saw only these CSI related errors: [ERROR] nomad.csi_plugin: csi raft apply failed: error="plugin in use" method=delete Then leader node was hanged at all, and we saw that cluster failed, client nodes went down. We saw in logs that new leader was not elected.

Reproduction steps

We had to re-create the failed nomad master from scratch and remove CSI from our cluster. But we've managed to reproduce this issue another way - just closed 4647 port on master leader node: iptables -A INPUT -p tcp --destination-port 4647 -j DROP We assume that it imitates the issue we had with CSI because it blocks not all, but a part of functionality of master node, what likewise we think happened when CSI hanged our leader node.

Expected Result

We expected that a new leader would be elected and the cluster would not fail.

Actual Result

New leader was not elected, client nodes were down. nomad server members output on leader node (where 4647 is blocked):

nomad-server-1.global  10.1.15.21  4648  alive   true    3             1.5.6  dc1         global
nomad-server-2.global  10.1.15.22  4648  alive   false   3             1.5.6  dc1         global
nomad-server-3.global  10.1.15.23  4648  alive   false   3             1.5.6  dc1         global

nomad server members output on non-leader node:

nomad-server-1.global  10.1.15.21  4648  alive   false   3             1.5.6  dc1         global
nomad-server-2.global  10.1.15.22  4648  alive   false   3             1.5.6  dc1         global
nomad-server-3.global  10.1.15.23  4648  alive   false   3             1.5.6  dc1         global

Error determining leaders: 1 error occurred:
        * Region "global": Unexpected response code: 500 (rpc error: failed to get conn: rpc error: lead thread didn't get connection)

Nomad logs

server1-leader.log server2.log server3.log client1.log client2.log

beninghton commented 1 year ago

@tgross btw I tested this case with 5 nomad server cluster, and the behaviour is exactly the same. I think that's not acceptable at all. 5 server cluster should be alive even after 2 nodes failure. I guess we will see the same result on any number of server nodes. It does not make sense to add more nodes to provide better High Availability because it's not right HA cluster behaviour.

tgross commented 1 year ago

ref https://github.com/hashicorp/nomad/issues/18063#issuecomment-1653593278

jrasell commented 1 year ago

Hi @beninghton and thanks for raising this issue.

We expected that a new leader would be elected and the cluster would not fail.

Yes I would agree and this should be the behaviour you are experiencing. The server logs you have included only have the error messages. These logs at a glance seem to be expected, as the server is unable to connect with the blocked server within the gossip pool. Do you have these logs available in debug mode, so we can see what other actions the server is performing? Without additional information like this, it's hard to understand what the problem might be.

If possible, could you also share the configuration you are using for the server and clients, retracted as required? I'll currently focus on this issue for communication, as I expect the information provided will also be relevant to https://github.com/hashicorp/nomad/issues/17974

beninghton commented 1 year ago

@jrasell Unfortunately we didn't have "DEBUG" log level enabled, only default "INFO". But now I've enabled debug and I'm gonna share the servers configuration server.hcl config we use:

server {
  enabled = true
  bootstrap_expect = 3
  heartbeat_grace = "30s"
}
acl {
  enabled = true
}
vault {
  enabled = true
  address = "http://vault:8200"
  create_from_role = "nomad-cluster"
  token = "secret_token"
}
log_level = "DEBUG"
enable_debug = true

And this is nomad.hcl:

datacenter = "dc1"
data_dir = "/opt/nomad/data"
telemetry {
  collection_interval = "1s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

And this is client configuration client.hcl:

bind_addr = "{{ GetInterfaceIP \"ens160\" }}"
client {
  enabled = true
  node_class = "common"
}
plugin "docker" {
  config {
    allow_privileged = true
    volumes {
      enabled = true
    }
    auth {
      config = "/opt/nomad/data/docker-auth.json"
    }
    allow_caps = [
      "CHOWN",
      "DAC_OVERRIDE",
      "FSETID",
      "FOWNER",
      "MKNOD",
      "SETGID",
      "SETUID",
      "SETFCAP",
      "SETPCAP",
      "NET_BIND_SERVICE",
      "SYS_CHROOT",
      "KILL",
      "AUDIT_WRITE",
      "NET_RAW"
    ]
    extra_labels = [
      "job_id",
      "job_name",
      "task_group_name",
      "task_name",
      "namespace",
      "node_id",
      "node_name"
    ]
  }
}
plugin "raw_exec" {
  config {
    enabled = false
  }
}
acl {
  enabled = true
}
vault {
  enabled = true
  address = "http://vault:8200"
}

This information is relevant to https://github.com/hashicorp/nomad/issues/17974 and https://github.com/hashicorp/nomad/issues/18063. Btw I'm sure that you will easily reproduce this problem on your environment using Iptables command I posted.

roman-vynar commented 3 weeks ago

Same here. Tested on Nomad versions 1.8.4, 1.9.1.

Super easy to reproduce: Go to a leader node and run this one liner:

export SLEEP=60 ; \
iptables -I INPUT 1 -p tcp --match multiport --dports 4646:4648 -j REJECT ; \
iptables -I INPUT 1 -p udp --match multiport --dports 4646:4648 -j REJECT ; \
sleep $SLEEP ; \
iptables -D INPUT -p tcp --match multiport --dports 4646:4648 -j REJECT ; \
iptables -D INPUT -p udp --match multiport --dports 4646:4648 -j REJECT

It will block input ports 4646-4648 for 60s and then unblock (delete rules). No election happens. And the most critical - all allocations in the cluster will be recreated once the above command completes.

It is worth to say - if I block INPUT and OUTPUT the election will happen, a new leader is elected and all good. If only INPUT is blocked for at least 60s then it is a big problem.

You may think it is a non-realistic scenario but no - it is really happened on Azure Cloud by itself when a VM being a leader got input somehow blocked and no elections happened for hours... (

hashicorp / nomad