hashicorp / terraform-aws-nomad

A Terraform Module for how to run Nomad on AWS using Terraform and Packer
Apache License 2.0
254 stars 187 forks source link

Missing ports in nomad security group #104

Open gthieleb opened 3 years ago

gthieleb commented 3 years ago

I have an issue when using the "security group" module, when the incoming_cidr is adpated to a custom IP addr (st. else then 0.0.0.0/0).

My ASG is created with help of the terraform-aws-modules/terraform-aws-autoscaling module using custom userdata and ubuntu 20.04. The userdata incorporates the hashicorp repos and performs a default installation of nomad and consul:

userdata script:

#!/bin/sh

apt update
apt install -y \
software-properties-common \
curl \
vim-tiny \
netcat \
file \
bash-completion

curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
apt update

apt install -y consul
apt install -y nomad

/etc/nomad.d/nomad.hcl:

datacenter = "us-east-1"
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"

# Enable the server
server {
  enabled = true
  bootstrap_expect = 3
}

consul {
  address = "127.0.0.1:8500"
  token   = "***************************"
}

/etc/consul.d/consul.hcl:

cat /etc/consul.d/consul.hcl 
datacenter = "us-east-1"
server              = true
bootstrap_expect    = 3
data_dir            = "/opt/consul/data"
client_addr         = "0.0.0.0"
log_level           = "INFO"
ui                  = true

# AWS cloud join
retry_join          = ["provider=aws tag_key=Nomad-Cluster tag_value=dev-nomad"]

# Max connections for the HTTP API
limits {
  http_max_conns_per_client = 128
}
performance {
    raft_multiplier = 1
}

acl {
  enabled        = true
  default_policy = "allow"
  enable_token_persistence = true
  tokens {
    master = "***************************************"
  }
}

encrypt = "************************"

When opening the browser I see the following message:

No Cluster Leader

The cluster has no leader. Read about Outage Recovery.

In the nomad logs it shows:

sudo journalctl -t nomad:

Oct 02 11:43:51 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:43:51.616Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Oct 02 11:43:57 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:43:57.320Z [ERROR] http: request failed: method=GET path=/v1/agent/health?type=server error="{"server":{"ok":false,"message":"No cluster lead
er"}}" code=500

It seems that the communication for port 4647 is currently not allowed within the security group.

Trying to access the port of a server node from another server node times out:

nc -zv -w 5 10.10.10.48 4647
nc: connect to 10.10.10.48 port 4647 (tcp) timed out: Operation now in progress

After allowing port 4647 communication within the security group the cluster server nodes starts replicate with each other:

Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:06.257Z [INFO]  nomad: serf: EventMemberJoin: ip-10-10-10-48.global 10.10.10.48
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:06.257Z [INFO]  nomad: serf: EventMemberJoin: ip-10-10-10-12.global 10.10.10.12
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:06.257Z [INFO]  nomad: adding server: server="ip-10-10-10-48.global (Addr: 10.10.10.48:4647) (DC: us-east-1)"
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:06.265Z [INFO]  nomad: found expected number of peers, attempting to bootstrap cluster...: peers=10.10.10.93:4647,10.10.10.48:4647,10.10.1
0.12:4647
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:06.270Z [INFO]  nomad: adding server: server="ip-10-10-10-12.global (Addr: 10.10.10.12:4647) (DC: us-east-1)"
Oct 02 11:44:06 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:06.725Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.151Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.151Z [INFO]  nomad.raft: entering candidate state: node="Node at 10.10.10.93:4647 [Candidate]" term=2
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.162Z [INFO]  nomad.raft: election won: tally=2
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.162Z [INFO]  nomad.raft: entering leader state: leader="Node at 10.10.10.93:4647 [Leader]"
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.163Z [INFO]  nomad.raft: added peer, starting replication: peer=10.10.10.48:4647
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.163Z [INFO]  nomad.raft: added peer, starting replication: peer=10.10.10.12:4647
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.163Z [INFO]  nomad: cluster leadership acquired
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.165Z [INFO]  nomad.raft: pipelining replication: peer="{Voter 10.10.10.12:4647 10.10.10.12:4647}"
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.166Z [WARN]  nomad.raft: appendEntries rejected, sending older logs: peer="{Voter 10.10.10.48:4647 10.10.10.48:4647}" next=1
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.168Z [INFO]  nomad.raft: pipelining replication: peer="{Voter 10.10.10.48:4647 10.10.10.48:4647}"
Oct 02 11:44:07 ip-10-10-10-93 nomad[3802]:     2021-10-02T11:44:07.186Z [INFO]  nomad.core: established cluster id: cluster_id=c40704d5-7b77-0ea7-9da2-eef39a58b4bb create_time=1633175047177920656

Question for me is if port 4647 is new or only missing in the security groups module?

The config from a installation using the root module differs slightly but I can't see any pinning to another port:

/opt/nomad/config/default.hcl:

datacenter = "us-east-1c"
name       = "i-06382f65cc9495792"
region     = "us-east-1"
bind_addr  = "0.0.0.0"

advertise {
  http = "172.31.84.5"
  rpc  = "172.31.84.5"
  serf = "172.31.84.5"
}

server {
  enabled = true
  bootstrap_expect = 3
}

consul {
  address = "127.0.0.1:8500"
}
gthieleb commented 3 years ago

Update: It seems port 4648 was missing too. In my previous tests I did not recognized that because I had previously allow-all inside the security group enabled.

Oct 02 12:44:49 ip-10-10-10-69 nomad[4129]:     2021-10-02T12:44:49.396Z [ERROR] nomad: error looking up Nomad servers in Consul: error="contacted 0 Nomad Servers: 2 errors occurred:
Oct 02 12:44:49 ip-10-10-10-69 nomad[4129]:         * Failed to join 10.10.10.38: dial tcp 10.10.10.38:4648: i/o timeout
Oct 02 12:44:49 ip-10-10-10-69 nomad[4129]:         * Failed to join 10.10.10.14: dial tcp 10.10.10.14:4648: i/o timeout