Services not unregistered

dani commented 1 year ago

Just upgraded to Nomad 1.5.2. Since then, services are not always unregistered from Consul service catalog when they are shuted down / upgraded. So old services versions appear as failed, eg

Environment :

Nomad 1.5.2 (with ACL), using the prebuilt binary
Consul 1.15.1 (with ACL), using the pre-built binary
Alma Linux 8

Haven't found yet a pattern to reproduce it 100% of the time

jrasell commented 1 year ago

Hi @dani, do you have any logs from the clients that were running the allocations that had services that should be deregistered? If you do and can pass them along, I can take a look through them and see if I can identify anything useful. If you have any other useful information that would be great, in order to try and reproduce this.

shoenig commented 1 year ago

1.5.2 included https://github.com/hashicorp/nomad/pull/16289/files which was supposed to fix a bug where we would attempt to deregister services twice. The key difference is we now set a flag that the services have been deregistered after the PostRun() allocrunner hook is run, preventing further attempts at deregistration.

Thinking about it now and reading our own docs, it is unclear whether PostRun implies an alloc is terminal ... if it isn't, and the services get re-registered for the same allocation, they'll never be deregistered.

dani commented 1 year ago

I was just able to trigger it on my prometheus Job :

I have a version of prom running, with 2 instances, and 2 services registered on Consul
I change a config file, and run the job again, Nomad starts the rolling update
In the end, I still have 2 instances running, two working services, and two failed services (corresponding to the two previous ones)

Here's my system logs during this rolling update :

nomad_unreg.txt

jrasell commented 1 year ago

Hi @dani, I've not been able to reproduce this yet locally; are you able to share the jobspec, or a redacted version you are using and what exactly is being changed before you register the new version? Thanks.

dani commented 1 year ago

ok, this particular job file was quite big, I'll try to reproduce with a simpler one (but I'll first have to install 1.5.2 again, as I had to revert to 1.5.1 because this issue made my test cluster totaly unusable)

martdah commented 1 year ago

I have seen the same issue, I've even reproduced it using the counter demo app. The issue only happens to me when ACL is enabled. nomad: 1.5.2-1 consul: 1.15.1-1 ubuntu: 20.04

deploy the demo app, add an additional tag, and re-deploy and you now have two instances registered in consul. exec into the downstream and curl $NOMAD_UPSTREAM_ADDR_servicename a number of times and you will see some return "connection reset by peer" as consul is now returning services that are "completed" in nomad.

hope this helps.

I have also reverted my lab to 1.5.1-1

chenjpu commented 1 year ago

I had the same problem and the way the NSD had the same problem

ngcmac commented 1 year ago

Hi, we had the same problem after upgrade Nomad from 1.4.5 to 1.4.7 and restart Consul agents on nodes. It seams to only affect services in service mesh in Consul. After the upgrade, Nomad services using the Connect Stanza with proxied upstreams, showed old versions of the deployment failing in Consul (v1.14.4).

Reverted to Nomad 1.4.5.

Regards.

jrasell commented 1 year ago

Hi everyone and thanks for the information and additional context. We have been able to reproduce this locally and have some useful information to start investigating, so will update here once we have anything more.

tgross commented 1 year ago

Additional repro that I've closed as a dupe, but just in case there's anything useful in the logs: https://github.com/hashicorp/nomad/issues/16739

jrasell commented 1 year ago

Hi everyone, we are continuing to look into this and while we were able to reproduce it in a manner, I wanted to gather some more information.

Those that have experienced this, are you setting the Consul agent ACL token via the consul acl set-agent-token command, the API equivalent, or via the agent config? This is a requirement in Consul v1.15.0 and later.

It seems to specifically affect Nomad v1.5.2, v1.4.7, and v1.3.12. If you do set the above token, are you able to provide context on the deployment that has the problem?

dani commented 1 year ago

In my case, I set the token in the config file, like

acl {
  enabled = true
  enable_token_persistence = true
  default_policy = "deny"
  tokens {
    default = "XXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
  }
}

Is this unsupported now ? (it's easier to set it in the config when deployed with tools like ansible)

ngcmac commented 1 year ago

We are also setting it via consul config:

{
  "acl": {
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    },
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "enable_token_persistence": true
  }
}

suikast42 commented 1 year ago

This issue still present in nomad v 1.5.3

CarelvanHeerden commented 1 year ago

An observation from my side.

I created this bash script to go clean up the services that were not unregistered in consul as a interim solution.

#!/bin/bash

CONSUL_HTTP_ADDR="http://consul.service.consul:8500/"
CONSUL_TOKEN=XXXX

# Get all unhealthy checks
unhealthy_checks=$(curl -s --header "X-Consul-Token: ${CONSUL_TOKEN}" "${CONSUL_HTTP_ADDR}/v1/health/state/critical" | jq -c '.[]')

# Iterate over the unhealthy checks and deregister the associated service instances
echo "$unhealthy_checks" | while read -r check; do
  service_id=$(echo "$check" | jq -r '.ServiceID')
  node=$(echo "$check" | jq -r '.Node')

  if [ "$service_id" != "null" ] && [ "$node" != "null" ]; then
    echo "Deregistering unhealthy service instance: ${service_id} on node ${node}"
    curl --header "X-Consul-Token: ${CONSUL_TOKEN}" -X PUT "${CONSUL_HTTP_ADDR}/v1/catalog/deregister" -d "{\"Node\": \"${node}\", \"ServiceID\": \"${service_id}\"}"
  else
    echo "Skipping check with no associated service instance or node"
  fi
done

This works and the service instances that are "dead" are removed from Consul UI, but only for a few seconds. They reappear in the UI moments later. I cannot confirm this, but I suspect Nomad is re-registering them.

This is still on Nomad 1.5.2. Upgrading to 1.5.3 now to test again.

suikast42 commented 1 year ago

This works and the service instances that are "dead" are removed from Consul UI, but only for a few seconds. They reappear in the UI moments later. I cannot confirm this, but I suspect Nomad is re-registering them.

Indeed. The issue does not belongs to consul. If you restart the nomad service then the dead services dissapears from nomad and consul.

This is still on Nomad 1.5.2. Upgrading to 1.5.3 now to test again.

1.5.3 have the same bug

rgruyters commented 1 year ago

We have the same issue in our environment running Consul version 1.15.2 with Nomad version 1.5.2. FWIW, we are setting our agent acl tokens via config.

{
  "acl": {
    "default_policy": "deny",
    "down_policy": "allow",
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    }
  },
}

suikast42 commented 1 year ago

We have the same issue in our environment running Consul version 1.15.2 with Nomad version 1.5.2. FWIW, we are setting our agent acl tokens via config.
{
  "acl": {
    "default_policy": "deny",
    "down_policy": "allow",
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    }
  },
}

I have no acl enabled. I don't think that this issue belongs to the acl system.

bfqrst commented 1 year ago

Not sure if related but I keep experiencing this without ACLs being turned on. It's hard to pinpoint, but from what I've seen it mostly happens when an ASG cycles the Nomad hosts and the job is being rescheduled on the new host. Combo is: Consul 1.15.2 and Nomad 1.5.3...

icyleaf commented 1 year ago

I had this issue with nomad service provider without ACLs, the context details in #16890.

ubuntu 22.04.2 LTS nomad 1.5.3 docker 23.0.3

fredwangwang commented 1 year ago

encountered this as well, I am able to reproduce quite reliably with the following sequence:

Restart Alloc
immediately after Stop Alloc

using nomad 1.4.7.

the staled entries in consul are automatically cleaned up after restarting the nomad client where the allocation was placed

suspecting it could be related to https://github.com/hashicorp/nomad/issues/16289, but havent confirm

update: downgraded clients to 1.4.6, and does not (seem) to see this issue anymore using the above steps

shoenig commented 1 year ago

https://github.com/hashicorp/nomad/pull/16905 should contain a fix for this - I've checked with a simple alloc restart and job stop repro described by @fredwangwang, but if other folks want to build the branch and confirm that would be helpful.

fredwangwang commented 1 year ago

@shoenig thanks for the fix!

rgruyters commented 1 year ago

Our issue is back again with Nomad version 1.5.5. Issue was re-introduce when we stop (and purged) a job and re-deployed the job. If more information is required please let me know.

shoenig commented 1 year ago

@rgruyters a fix went into 1.5.6

dm-evstafiev commented 11 months ago

Reproduced this in version 1.6.1 with ACL ((

lgfa29 commented 10 months ago

Reproduced this in version 1.6.1 with ACL ((

Oh no, sorry this is still happening to you.

Would you be able to provide some reproduction steps?

lgfa29 commented 10 months ago

https://github.com/hashicorp/nomad/issues/18203 reports a similar issue and with Nomad 1.6.1 as well, so there still seems to be a problem with services not being unregistered (either in Consul or Nomad). I'm going to reopen this one.

blmhemu commented 10 months ago

I am running solo nomad with ACL (no consul) and could repro this (See #18203 as mentioned previously)

jzingh98 commented 10 months ago

Running into this issue as well. CONSULVERSION=1.16.1 and NOMADVERSION=1.6.1

grembo commented 9 months ago

Just as a datapoint: We experienced an issue that looked like this one when updating to 1.5.3 and also subsequently going to 1.5.6 (plus updating consul to a version more recent than what we ran before).

In our case (running consul 1.16.0 and nomad 1.5.6) it was caused by the client node's agent policy containing the FQDN of the node - which for some reason worked with older versions of consul.

So instead of:

node "compute2.dc1.consul" { policy = "write" }

we changed the agent policy to contain

node "compute2" { policy = "write" }

We just deployed the fix and our simple manual tests, which failed before, are working ok now. If we encounter the issue again when running more complex payloads I'll report it here.

MattJustMatt commented 9 months ago

An additional datapoint-- we are experiencing this issue with native Nomad service discovery too.

We just ran a load test, spinning up thousands of services and killing them at 45s. The test was more than our cluster hardware could support and there was a severe backlog of evals, but eventually things caught up and all the jobs were marked dead.

We were left with 3 "ghost" services showing in the /v1/service/[service-name] endpoint. They point to jobs that were long ago garbage collected after dying.

Restarting the cluster, and the clients was not enough. We had to delete the cluster data dir in order to remove the extra services.

marekhanzlik commented 9 months ago

Happened to us on v1.6.2 with Nomad Discovery + ACL

marekhanzlik commented 9 months ago

Is there a way to manualy unregister the services in Nomad Discovery? Right now nomad is in unusable state for us because services try resolve wrong adresses.

I've tried to run job stop -purge which removed the job, then run system gc just to be sure . But after redeploying the job, there is once again second service endpoint pointing to already remove allocation - which renders the cluster unusable

This should be a priority issue

sofixa commented 9 months ago

@marekhanzlik you can delete zombie service instances like so:

nomad service info -verbose $SERVICE_NAME

and then doing: nomad service delete $SERVICE_NAME $ID

suikast42 commented 9 months ago

@marekhanzlik you can delete zombie service instances like so:

nomad service info -verbose $SERVICE_NAME

and then doing: nomad service delete $SERVICE_NAME $ID

But that works only for nomad registred services without consul right?

icyleaf commented 9 months ago

I published a workaround service to solve the issue: https://github.com/icyleaf/nomad-invalid-services-cleaner

job "nomad-invalid-services-cleaner" {
  type        = "batch"

  periodic {
    prohibit_overlap  = true
    cron              = "0/10 * * * * *"
    time_zone         = "Asia/Shanghai"
  }

  group "services_cleaner" {
    task "cleaner" {
      driver = "docker"

      config {
        image = "ghcr.io/icyleaf/nomad-invalid-services-cleaner:0.1"
      }

      template {
        destination = "secrets/.env"
        env         = true
        data        = <<-EOF
        ONESHOT         = true

        NOMAD_ENDPOINT  = http://{{ env "attr.unique.network.ip-address" }}:4646

        {{- with nomadVar "nomad/jobs/nomad-invalid-services-cleaner" }}
        NOMAD_TOKEN = {{ .nomad_token }}
        {{- end}}
        EOF
      }

      resources {
        cpu     = 50
        memory  = 50
      }
    }
  }
}

crystalin commented 8 months ago

Adding a bit of extra info, it happens on my servers too (ACL enabled also, no consul) when the server has a hard reboot (power button switched off/on)

aroundthfur commented 7 months ago

I can confirm this also on v1.6.2 with Nomad Discovery and no ACL.

harrismcc commented 5 months ago

Just noting that I'm having this issue as well, for nomad 1.7.2 w/o Consul or ACL. @icyleaf's workaround worked for me, but is obviously not an ideal solution.

shochdoerfer commented 5 months ago

Same here. Nomad 1.7.4 with Nomad provider.

shochdoerfer commented 5 months ago

It happened again today after restarting some of our nodes. Interestingly, when running nomad service info -verbose $SERVICE_NAME, the CLI output shows me a completely different service.

Jamesits commented 1 month ago

Can reproduce the same issue on Nomad 1.7.6 and Nomad service discovery. nomad service info -verbose $SERVICE_NAME returns information about another service which is completely unrelated. Could not recover by stopping then starting the job.

Also, the bogus service's tags will not be updated on a new deployment, which cause our Traefik to reject the new service as it found 2 services of the same name with different settings.

Jamesits commented 1 month ago

I published a workaround service to solve the issue: https://github.com/icyleaf/nomad-invalid-services-cleaner

It doesn't work on my Nomad 1.7.6 cluster somehow, but I ported the code to Golang with the latest Nomad SDK and it worked well.

package main

import (
    "fmt"
    "github.com/hashicorp/nomad/api"
)

func main() {
    c, err := api.NewClient(api.DefaultConfig())
    if err != nil {
        panic(err)
    }

    // list all services
    services, _, err := c.Services().List(&api.QueryOptions{Namespace: "*"})
    if err != nil {
        panic(err)
    }

    for _, perNamespaceServiceListStub := range services {
        fmt.Printf("Namespace: %s\n", perNamespaceServiceListStub.Namespace)

        for _, svc := range perNamespaceServiceListStub.Services {
            fmt.Printf("\tService: %s\n", svc.ServiceName)
            svcInfo, _, err := c.Services().Get(svc.ServiceName, &api.QueryOptions{Namespace: perNamespaceServiceListStub.Namespace})
            if err != nil {
                panic(err)
            }

            for _, s := range svcInfo {
                // test if we have a service without an associated allocation
                _, _, err := c.Allocations().Info(s.AllocID, &api.QueryOptions{Namespace: perNamespaceServiceListStub.Namespace})
                if err != nil {
                    fmt.Printf("\t\tInvalid service %s: %s, ", s.ID, s.ServiceName)
                    // try to remove it
                    _, err := c.Services().Delete(s.ServiceName, s.ID, &api.WriteOptions{Namespace: s.Namespace})
                    fmt.Printf("remove: %v\n", err)
                } else {
                    fmt.Printf("\t\tNormal service %s: %s\n", s.ID, s.ServiceName)
                }
            }
        }
    }
}

Usage:

export NOMAD_ADDR=xxx
export NOMAD_TOKEN=xxx
go run main.go

grembo commented 1 month ago

Can reproduce the same issue on Nomad 1.7.6 and Nomad service discovery. nomad service info -verbose $SERVICE_NAME returns information about another service which is completely unrelated. Could not recover by stopping then starting the job.

How do you reproduce it?

linuxoid69 commented 1 month ago

I've had this problem as well. Solved it with the correct consul config

My example

# Full configuration options can be found at https://www.consul.io/docs/agent/options.html

# datacenter
datacenter = "dc-01"

# nodename
node_name = "node.inf-01.local"

# data_dir
data_dir = "/opt/consul"

# client_addr
client_addr = "0.0.0.0"

# ui
ui_config{
    enabled=true
}

# server
server = true

# Bind addr
bind_addr = "10.101.1.11"

# bootstrap_expect
bootstrap_expect=3

# retry_join
retry_join = ["node.inf-01.local", "node.inf-02.local", "node.inf-03.local"]
rejoin_after_leave = true

disable_update_check = true
encrypt =  "xxxxx"
log_rotate_bytes = 104857600
leave_on_terminate =  true
enable_script_checks = true

acl = {
    enabled = true
    enable_token_persistence = true
    default_policy = "deny"
    down_policy = "allow"

    tokens = {
        agent = "xxxx"
        initial_management = "xxxx"
        default = "xxxx"
    }
}

telemetry{
    prometheus_retention_time = "1s"
}

log_file = "/var/log/consul/log-file"

Jamesits commented 1 month ago

@grembo

How do you reproduce it?

My clients are distributed over 3 availability zones. Some day the DNS server on one AZ seems to fail for ~0.5hrs, then recovered. After that I observed the bug. Don't know how to reproduce it on purpose.

blmhemu commented 1 month ago

Another way to repro is to do systemctl restart nomad on client(s) - When I run this, service templates are getting messed up.

tgross commented 1 month ago

Hi folks! I'm picking this issue back up. There's obviously a lot of history here, across multiple versions of Nomad, various configurations, and several attempts to fix it, so I want to make sure we're focused on the symptoms that are on current versions. The Nomad 1.5.x series is about to be out of support once Nomad 1.8.0 ships in the next few weeks, so I'm going to focus on the timeline starting from Nomad 1.6.1.

There are reports since Nomad 1.6.1 of services not being registered for both the Nomad and Consul provider. For Consul services, the general workflow of registering a service is as follows:

If workload identity is configured, the allocation uses it to login to the local Consul agent to get a Consul token.
The allocation's service hook registers the workload with the Nomad agent's internal Consul service client component, using that token (or the agent's own token if WI is not in use).
The service client periodically (frequently) flushes its state to the local Consul agent
The local Consul agent asynchronously syncs its state with the Consul servers
This is the same workflow for deregistration -- the entire process is async. The deregistration will use the same Consul token that registered the service.

For the Nomad provider, the workflow is much simpler:

The allocation's service hook registers the workload with the Nomad agent's NSD client component.
The NSD client synchronously sends RPCs to the Nomad servers before acknowledging the workload registration. These RPCs are authenticated using the node's secret.

So there's enough difference in the implementation of these two providers that I'd like to break out discussion of each separately.

Consul provider

Several folks using the Consul provider since 1.6.1 have reported that their issue was solved via ensuring they have the correct Consul configuration. For example, @linuxoid69 wrote "Solved it with the correct consul config". Which is great, @linuxoid69! It would be really helpful if you could describe what part of this configuration you previously had missing / incorrect. That would help us figure out if there's something we're missing in our documentation for Nomad and/or Consul.

Nomad provider

We have a lot more reports of this issue for the Nomad provider since 1.6.1. Reported symptoms:

Powering off the host triggers it (i.e. not shutting down the client agent gracefully).
Restarting the Nomad client agent via systemd triggers it (this should be a graceful restart)
Restarting the Nomad agent doesn't clear the problem with the Nomad provider.

What's Next?

We have stronger evidence of a bug for the Nomad provider than we currently do for Consul. This is where I'm going to focus for the time being unless we get more data that any issues with Consul aren't a matter of configuration.

My next step is to develop a reliable minimal reproduction and then start building some hypotheses for the source of the problem.

How Can I Help?

At this point, there's no need to report "I see this too" with your version of Nomad and Consul. We can safely assume it impacts all supported versions at this point.

What may be helpful is if you have specific scenarios where you can reliably reproduce the issue and that haven't already been reported. In this case, you should include version of Nomad, version of Consul, which service provider you're using, and whatever agent configurations and job specifications are needed to trigger the issue. Also, if you see errors in the client logs around service deregistration, that would be helpful to add (especially the presence or absence of the "failed to delete service registration" log line)

linuxoid69 commented 1 month ago

@tgross

I found my old config that didn't work. 😀

# Full configuration options can be found at https://www.consul.io/docs/agent/options.html

# datacenter
datacenter = "dc-01"

# nodename
node_name = "node.inf-01.local"

# data_dir
data_dir = "/opt/consul"

# client_addr
client_addr = "0.0.0.0"

# ui
ui_config{
    enabled=true
}

# server
server = true

# Bind addr
bind_addr = "10.101.1.11"

# bootstrap_expect
bootstrap_expect=3

# retry_join
retry_join = ["node.inf-01.local", "node.inf-02.local", "node.inf-03.local"]
rejoin_after_leave = true

disable_update_check = true
encrypt =  "xxxxx"
log_rotate_bytes = 104857600
leave_on_terminate =  true
enable_script_checks = true

acl_agent_token = "xxxx"
acl_token = "xxxx"
acl_agent_master_token = "xxxx"
acl_master_token = "xxxx"

acl_ttl = "5s"
acl_default_policy = "deny"
acl_down_policy = "allow"

acl{
    enabled = true
    default_policy = "deny"
    enable_token_persistence = true
}

telemetry{
    prometheus_retention_time = "1s"
}

log_file = "/var/log/consul/log-file"

hashicorp / nomad

Services not unregistered #16616