hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.56k stars 1.92k forks source link

Consul services drop periodically #22427

Open liuchenrang opened 1 month ago

liuchenrang commented 1 month ago

Nomad version

Output from nomad version v1.7.7

Operating system and Environment details

stream-8
SSH_CONNECTION=198.19.249.3 60379 198.19.249.13 22
LANG=zh_CN.UTF-8
HISTCONTROL=ignoredups
HOSTNAME=centos-8-3
which_declare=declare -f
XDG_SESSION_ID=c32
USER=root
PWD=/root
HOME=/root
SSH_CLIENT=198.19.249.3 60379 22
SSH_TTY=/dev/pts/2
MAIL=/var/spool/mail/root
TERM=xterm-256color
SHELL=/bin/bash
SHLVL=1
LOGNAME=root
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/0/bus
XDG_RUNTIME_DIR=/run/user/0
PATH=/opt/orbstack-guest/bin-hiprio:/opt/orbstack-guest/data/bin/cmdlinks:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/orbstack-guest/bin:/root/bin
DEBUGINFOD_URLS=https://debuginfod.centos.org/
HISTSIZE=1000
LESSOPEN=||/usr/bin/lesspipe.sh %s
BASH_FUNC_which%%=() {  ( alias;
 eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot $@
}
_=/usr/bin/env

Issue

In cluster mode, both client and server are enabled, and the deployment type is service. The service will be unregistered and automatically registered later

Reproduction steps

1 cluster mode 3 sets. 2 on a single node, client and server are enabled at the same time. 3 deploy 3 sets of whoami 4 consul cluster

Expected Result

Three whoami are stable online.

Actual Result

Periodically drop the line, and soon come online again.

Job file (if appropriate)

job "web" {
  datacenters = ["dc1"]
  type        = "service"
  # constraint {
  #   attribute = "${attr.unique.hostname}"
  #   value     = "centos-8-3"
  # }

  meta {
    stream = "3"
  }

  group "app" {
    count = 3
    network {
      port "http" {
        to = 8181
      }
    }

    service {
      tags = ["urlprefix-/"]
      port = "http"
      name = "hhhhlll"
      check {
        name     = "AppWebCheck"
        type     = "http"
        port     = "http"
        path     = "/"
        interval = "15s"
        timeout  = "20s"
      }
    }
    task "hiweb" {
      driver = "docker"
      config {
        image = "xinghuo/hi:20240530"
        ports = ["http"]
      }
    }
  }
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

(edited by @tgross to make syntax legible)

liuchenrang commented 1 month ago

I analyze the code in service_client.go, in this code. Func (c * ServiceClient) sync. 4 detections will be skipped, resulting in deletion. The first place. Check whether the service is recorded locally, and the program remarks are as follows. / / Known service, skip. The second place. Code is if! isNomadService (id) | |! c.isClientAgent. Since cleint=true is turned on, this is also logical. Finally, the cancellation service deletion logic is reached. Question! If you understand that nomad provides check services, your app does not need to interface with consul service registration and discovery. If you do so, you should have no problem. Here I want to use the official check function, so there is a problem with feedback.

liuchenrang commented 1 month ago

If it's bug, I think the reason is. This code c.services. This is a local service list, but the last traversal used all the service lists registered in consul, resulting in the deletion of services managed by other nodes, and mutual deletion due to cluster mode!

tgross commented 2 weeks ago

Hi @liuchenrang! Does each one of your Nomad client agents have its own Consul agent? They should not share Consul agents, as described in the Consul configuration docs:

An important requirement is that each Nomad agent talks to a unique Consul agent. Nomad agents should be configured to talk to Consul agents and not Consul servers. If you are observing flapping services, you may have multiple Nomad agents talking to the same Consul agent. As such avoid configuring Nomad to talk to Consul via DNS such as consul.service.consul