hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.46k stars 4.43k forks source link

Duplicate nomad services #6842

Open OneCricketeer opened 5 years ago

OneCricketeer commented 5 years ago

Overview of the Issue

Duplicated service registration of Nomad (and no Nomad icon). Not sure if this is a Consul or Nomad issue...

As shown in the image, I have an extra nomad and nomad-client service on one node, yet not able to debug from where it originates. The same service checks also appear in the appropriately labelled nomad-clients and nomad-servers.

image

Reproduction Steps

After installing the cluster, I think it was working as expected without the extra nomad and nomad-client, but then I was playing around with systemctl enable --now nomad, then I remember seeing the extra one.

Consul info for both Client and Server

I only have 3 nodes, all servers

Server info ``` agent: check_monitors = 0 check_ttls = 0 checks = 8 services = 9 build: prerelease = revision = version = 1.6.2 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = false leader_addr = 192.168.1.121:8300 server = true raft: applied_index = 133907 commit_index = 133907 fsm_pending = 0 last_contact = 26.883494ms last_log_index = 133907 last_log_term = 1094866 last_snapshot_index = 131138 last_snapshot_term = 1094751 latest_configuration = [{Suffrage:Voter ID:a1bd80bd-e3c1-96b4-d5bc-30e620485b96 Address:192.168.1.122:8300} {Suffrage:Voter ID:29349689-81c8-b1bd-4cec-daa1a894d773 Address:192.168.1.121:8300} {Suffrage:Voter ID:e5049ebb-25b4-fe97-3da5-0d4ee3fe3d33 Address:192.168.1.120:8300}] latest_configuration_index = 40078 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 1094866 runtime: arch = arm64 cpu_count = 4 goroutines = 93 max_procs = 4 os = linux version = go1.12.13 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 38 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 66 members = 3 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 49 members = 3 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Ubuntu ARM64

Extra

Only one nomad process is running

$ sudo ps -ef | grep nomad | grep -v grep
root     23859     1  5 03:40 ?        00:07:06 /usr/local/bin/nomad agent -config=/etc/nomad.d

And nomad is otherwise healthy

$ nomad node status
ID        DC           Name   Class   Drain  Eligibility  Status
a9dcb18b  picocluster  pico1  <none>  false  eligible     ready
4fd226cf  picocluster  pico0  <none>  false  eligible     ready
14bbadbc  picocluster  pico2  <none>  false  eligible     ready
OneCricketeer commented 5 years ago

Nomad Configuration

Here are the configs for the one node that shows those services

Base ``` $ cat /etc/nomad.d/base.hcl name = "pico0" region = "lan" datacenter = "picocluster" enable_debug = false disable_update_check = false bind_addr = "192.168.1.120" advertise { http = "192.168.1.120:4646" rpc = "192.168.1.120:4647" serf = "192.168.1.120:4648" } ports { http = 4646 rpc = 4647 serf = 4648 } consul { # The address to the Consul agent. address = "localhost:8500" token = "" # The service name to register the server and client with Consul. server_service_name = "nomad-servers" client_service_name = "nomad-clients" tags = {} # Enables automatically registering the services. auto_advertise = true # Enabling the server and client to bootstrap using Consul. server_auto_join = true client_auto_join = true } data_dir = "/var/nomad" log_level = "INFO" enable_syslog = true leave_on_terminate = true leave_on_interrupt = false acl { enabled = false token_ttl = "30s" policy_ttl = "30s" replication_token = "" } vault { enabled = false address = "0.0.0.0" allow_unauthenticated = true create_from_role = "" task_token_ttl = "" ca_file = "" ca_path = "" cert_file = "" key_file = "" tls_server_name = "" tls_skip_verify = false token = "" } ```
Client ``` client { enabled = true node_class = "" no_host_uuid = false max_kill_timeout = "30s" network_speed = 0 cpu_total_compute = 0 gc_interval = "1m" gc_disk_usage_threshold = 80 gc_inode_usage_threshold = 70 gc_parallel_destroys = 2 reserved { cpu = 0 memory = 0 disk = 0 } } ```
Server ``` server { enabled = true bootstrap_expect = 3 rejoin_after_leave = false enabled_schedulers = ["service","batch","system"] num_schedulers = 4 node_gc_threshold = "24h" eval_gc_threshold = "1h" job_gc_threshold = "4h" encrypt = "" } ```
OneCricketeer commented 5 years ago

SystemD file

Expand ``` ### BEGIN INIT INFO # Provides: nomad # Required-Start: $local_fs $remote_fs # Required-Stop: $local_fs $remote_fs # Default-Start: 2 3 4 5 # Default-Stop: 0 1 6 # Short-Description: distributed scheduler # Description: distributed, highly available, datacenter-aware scheduler ### END INIT INFO [Unit] Description=nomad agent Documentation=https://nomadproject.io/docs/ Wants=basic.target After=basic.target network.target [Service] User=root Group=bin ExecStart=/usr/local/bin/nomad agent -config=/etc/nomad.d ExecReload=/bin/kill -HUP $MAINPID KillMode=process KillSignal=SIGINT LimitNOFILE=infinity LimitNPROC=infinity Restart=on-failure RestartSec=42s StartLimitBurst=3 StartLimitIntervalSec=10 TasksMax=infinity [Install] WantedBy=multi-user.target ```
crhino commented 5 years ago

Hi there. One debugging step that might provide data is to look at the output of curl localhost:8500/v1/catalog/service/nomad to see what nodes the service is defined on.

It's hard to tell how exactly you got into this situation, so if you were able to provide reproduction steps, that would be helpful as well.

OneCricketeer commented 5 years ago

I can see in the UI the services are defined on pico0, which are the configuration files that were provided.

I installed each Nomad server and client via https://github.com/brianshumate/ansible-nomad

And, as mentioned, I had use systemctl enable --now nomad to ensure that nomad started on reboot.

Those were the only changes I remember doing regarding Nomad.