hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.92k stars 1.95k forks source link

Some services are being registered all the time #18538

Open jorgemarey opened 1 year ago

jorgemarey commented 1 year ago

Nomad version

Nomad v1.5.6

Issue

We had a problem where we saw some consul service flapping to critical for a moment (less than a second) and then become healthy again. While doing some investigation about this we saw that nomad is re-registering the services constantly.

In the client logs we can see:

2023-09-20T10:04:07.331Z [TRACE] consul.sync: execute sync: reason=periodic
2023-09-20T10:04:07.406Z [TRACE] consul.sync: registrations different: id=_nomad-task-6de92b6e-9099-654a-00f2-93ac2b255142-xxxx-8080 field=connect_sidecar wanted="\"xxxx\"" existing="\"xxxx\""
2023-09-20T10:04:07.406Z [TRACE] consul.sync: must register service: id=_nomad-task-6de92b6e-9099-654a-00f2-93ac2b255142-group-xxxx-8080 exists=true reason=periodic
2023-09-20T10:04:07.717Z [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=0 deregistered_checks=1

We found that this happen with all services that have upstreams configured. We found that the problem occurs here, when in nomad the mesh gateway mode is left empty, consul sets it to local, so when checking the difference between the value in the job and in consul, nomad finds a difference and tries to register it again.

Related to this, in that service we modified the interval of the check, and saw that in the process of registering again the service, the old and new checks are both being registered (for a moment before removing again the old check in the sync proccess.

  "_nomad-check-01300b3da45ff5b95964e40301deb8058883665d": {
    "Node": "wrkr-ec26d795-8a28",
    "CheckID": "_nomad-check-01300b3da45ff5b95964e40301deb8058883665d",
    "Name": "xxxx-http",
    "Status": "passing",
    "Notes": "",
    "Output": "HTTP GET http://10.15.36.66:21795/_service/v0/health: 200  Output: {\n  \"status\" : \"UP\"\n}",
    "ServiceID": "_nomad-task-6de92b6e-9099-654a-00f2-93ac2b255142-group-xxxx-8080",
    "ServiceName": "xxxx",
    "ServiceTags": [],
    "Type": "http",
    "Interval": "30s",
    "Timeout": "5s",
    "ExposedPort": 0,
    "Definition": {},
    "Partition": "default",
    "Namespace": "default",
    "CreateIndex": 0,
    "ModifyIndex": 0
  },
  "_nomad-check-036f17f519b332a2439eb8b5c240f3e873273cdd": {
    "Node": "wrkr-ec26d795-8a28",
    "CheckID": "_nomad-check-036f17f519b332a2439eb8b5c240f3e873273cdd",
    "Name": "xxxx-http",
    "Status": "critical",
    "Notes": "",
    "Output": "",
    "ServiceID": "_nomad-task-6de92b6e-9099-654a-00f2-93ac2b255142-group-xxxx-8080",
    "ServiceName": "xxxx",
    "ServiceTags": [],
    "Type": "http",
    "Interval": "10s",
    "Timeout": "2s",
    "ExposedPort": 0,
    "Definition": {},
    "Partition": "default",
    "Namespace": "default",
    "CreateIndex": 0,
    "ModifyIndex": 0
  },

Reproduction steps

For the second problem

apollo13 commented 1 year ago

I am seeing the same at the trace level, happening all 30 seconds:

2023-09-29T22:09:01.139+0200 [TRACE] consul.sync: registrations different: id=_nomad-task-03b36744-c471-b2db-7eed-9e56ddbd6c08-group-vector-vector-8686 field=connect_sidecar wanted="\"vector\"" existing="\"vector\""
2023-09-29T22:09:01.139+0200 [TRACE] consul.sync: must register service: id=_nomad-task-03b36744-c471-b2db-7eed-9e56ddbd6c08-group-vector-vector-8686 exists=true reason=periodic
2023-09-29T22:09:01.144+0200 [TRACE] consul.sync: registrations different: id=_nomad-task-6e9302b4-ffbd-1a72-fa15-99b0fe556c2a-group-kibana-kibana-5601 field=connect_sidecar wanted="\"kibana\"" existing="\"kibana\""
2023-09-29T22:09:01.144+0200 [TRACE] consul.sync: must register service: id=_nomad-task-6e9302b4-ffbd-1a72-fa15-99b0fe556c2a-group-kibana-kibana-5601 exists=true reason=periodic
2023-09-29T22:09:01.153+0200 [DEBUG] consul.sync: sync complete: registered_services=2 deregistered_services=0 registered_checks=0 deregistered_checks=0

Interestingly enough this only seems to happen for 2 out of 4 services with sidecars on the host. Which information can I provide to aid debugging=

lgfa29 commented 1 year ago

Thanks for the report @jorgemarey.

I was able to reproduce the first issue (but not the second so far). Curiously, in my case, the diff was caused by DestinationType. Looking through Consul's docs I think these are two fields that could cause problems because they have a Consul default that Nomad is not aware of.

In the case of MeshGateway.Mode it can be even trickier because an empty string in the job means that Consul will look for the actual value in several places, including service defaults, which Nomad is not aware of. Do you have any service default that could be affecting this value?

https://github.com/hashicorp/nomad/pull/18692 helps the situation a bit by allowing these two values to be empty, meaning Nomad will just accept whatever is set in Consul as correct.

@apollo13 unfortunately we don't have very descriptive logs on this diff logic. I'm assuming @jorgemarey patched the Nomad code to drill down exactly which filed was the problem 😅

One thing you could try is to compare these services definition in your job and what's in the Consul agent as reported by the /v1/agent/services endpoint.

In my case I noticed the DestinationType was set to service and so I updated my service definition to this and the sync stopped:

service {
  name = "nginx"
  port = "http"

  connect {
    sidecar_service {
      proxy {
        upstreams {
          destination_name = "whoami"
+         destination_type = "service"
          local_bind_port  = 8080
        }
      }
    }
  }
}
apollo13 commented 1 year ago

@lgfa29 Thank you, adding destination_type = "service" fixed it here as well. I will go through all hosts in the cluster and see if that is all that is needed or if there are other fields missing.

EDIT:// Funny story though: I looked through the nodes of a test cluster and only one job has this issue (and not the same job as last time). So I wonder if depending on how/when against which consul version the job got registered it might have worked? Or maybe it depends on the moon…

jorgemarey commented 1 year ago

Thanks for the report @jorgemarey.

I was able to reproduce the first issue (but not the second so far). Curiously, in my case, the diff was caused by DestinationType. Looking through Consul's docs I think these are two fields that could cause problems because they have a Consul default that Nomad is not aware of.

In the case of MeshGateway.Mode it can be even trickier because an empty string in the job means that Consul will look for the actual value in several places, including service defaults, which Nomad is not aware of. Do you have any service default that could be affecting this value?

Yeah, we have a consul proxy-defaults config with

    MeshGateway = {
      Mode = "local"
    }

I guess that's what caused this.

18692 helps the situation a bit by allowing these two values to be empty, meaning Nomad will just accept whatever is set in Consul as correct.

Seeing that PR. I see a situation that I don't know what will happen: If someone has a value in the nomad configuration and then deletes it. Will this change the service in consul only the first time the job is launched but avoid changing it later constantly? If it doesn't register the service with that value empty, the value won't change in consul and the user may expect it to change.

@apollo13 unfortunately we don't have very descriptive logs on this diff logic. I'm assuming @jorgemarey patched the Nomad code to drill down exactly which filed was the problem 😅

Yep, I did patch the code to output where nomad and consul had the diff.

Regarding the second issue. The problem with that is that the check is only registered with consul for a few milliseconds before it's removed again by the synchronization process. Once the synchronization starts nomad will register the service with both checks as it sees a difference in the service between what it knows and it sees in consul and later during that process nomad will see a check that shouldn't be there and will remove it, so it'll only last for a few milliseconds in consul. But I managed to notice that by making blocking queries to the consul agent API and also in the nomad logs I could see (constantly):

2023-09-20T10:04:07.717Z [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=0 deregistered_checks=1

I'll try reproduce it myself and provide some information about how to do it if I manage to so.

lgfa29 commented 1 year ago

If someone has a value in the nomad configuration and then deletes it. Will this change the service in consul only the first time the job is launched but avoid changing it later constantly? If it doesn't register the service with that value empty, the value won't change in consul and the user may expect it to change.

Hum...yeah, I think you're right. To make things worse I think the service will not be updated at all because that difference method is also used to detect job updates sync.

I think we're lacking expressiveness here in which an string here means "whatever Consul says it should be", meaning we shouldn't sync, but what user's usually intend for an empty value is for it to be set to Consul's default. So I think I will need to go back to the drawing board on this one 🤔

Ask workaround for now, to avoid the unnecessary syncs you can set all the expected values directly in the job. More specifically, destination_type and gateway.mode.

dani commented 9 months ago

Not only destination_type and gateway.mode are affected. When an upstream is declared as an http service (in service-defaults Protocol = "http"), we must also add this in the job description (in my example, immich-ml is the upstream, http service, and immich is the consumer of this service)

          proxy {
            upstreams {
              destination_name = "immich-ml"
              destination_type = "service"
              local_bind_port  = 3003
              config {
                protocol = "http"
              }
            }
          }

Without this, Nomad agent is also re-syncing the service on Consul every 30 sec :

févr. 12 13:15:32 ct-w-1 nomad[2691136]:     2024-02-12T13:15:32.910+0100 [TRACE] consul.sync: must register service: id=_nomad-task-c5309ab5-1650-63be-9801-2f9caec34ab8-group-immich-immich-3001 exists=true reason=periodic
févr. 12 13:15:33 ct-w-1 nomad[2691136]:     2024-02-12T13:15:33.092+0100 [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=0 deregistered_checks=0
févr. 12 13:16:03 ct-w-1 nomad[2691136]:     2024-02-12T13:16:03.092+0100 [TRACE] consul.sync: execute sync: reason=periodic
févr. 12 13:16:03 ct-w-1 nomad[2691136]:     2024-02-12T13:16:03.107+0100 [TRACE] consul.sync: registrations different: id=_nomad-task-c5309ab5-1650-63be-9801-2f9caec34ab8-group-immich-immich-3001 field=connect_sidecar wanted="\"immich\"" existing="\"immich\""

This is a pain to maintain :-(