hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.93k stars 1.96k forks source link

Nomad 1.9.1 - job_gc not working #24387

Open ngcmac opened 5 days ago

ngcmac commented 5 days ago

Nomad version

1.9.1

Operating system and Environment details

Debian 12 (amd64) 1 server + 3 clients

Issue

After the upgrade from 1.8.3 to 1.9.1 in our CI environment, it seems that job_gc is not working. Our config: "job_gc_interval": "1h" "job_gc_threshold": "120h"

Metrics:

image

Reproduction steps

Upgrade to 1.9.1

Expected Result

job_gc keeps working as expected

Actual Result

job_gc not working

Job file (if appropriate)

Nomad Server logs (if appropriate)

I don't see nothing relevant indicating a problem, GC seems to find objects to collect:

2024-11-07T13:50:36.155793+00:00 ci-hashicluster-0 nomad[2745402]:     2024-11-07T13:50:36.155Z [TRACE] worker: changed workload status: worker_id=04edb213-5bd4-dd4d-15b6-c4067489c076 from=WaitingForRaft to=Scheduling
2024-11-07T13:50:36.155853+00:00 ci-hashicluster-0 nomad[2745402]:     2024-11-07T13:50:36.155Z [DEBUG] core.sched: eval GC scanning before cutoff index: index=5913301 eval_gc_threshold=1h0m0s
2024-11-07T13:50:36.155920+00:00 ci-hashicluster-0 nomad[2745402]:     2024-11-07T13:50:36.155Z [DEBUG] core.sched: eval GC scanning before cutoff index: index=5908063 batch_eval_gc_threshold=24h0m0s
2024-11-07T13:50:36.160338+00:00 ci-hashicluster-0 nomad[2745402]:     2024-11-07T13:50:36.160Z [DEBUG] core.sched: eval GC found eligibile objects: evals=4 allocs=1
2024-11-07T13:50:36.168824+00:00 ci-hashicluster-0 nomad[2745402]:     2024-11-07T13:50:36.168Z [DEBUG] worker: ack evaluation: worker_id=04edb213-5bd4-dd4d-15b6-c4067489c076 eval_id=9706881b-9008-2cdc-1eac-68d7f8fe28b8 type=_core namespace=- job_id=eval-gc node_id="" triggered_by=scheduled
2024-11-07T13:50:36.169256+00:00 ci-hashicluster-0 nomad[2745402]:     2024-11-07T13:50:36.169Z [TRACE] worker: changed workload status: worker_id=04edb213-5bd4-dd4d-15b6-c4067489c076 from=Scheduling to=WaitingToDequeue
2024-11-07T13:50:36.169616+00:00 ci-hashicluster-0 nomad[2745402]:     2024-11-07T13:50:36.169Z [DEBUG] worker: dequeued evaluation: worker_id=04edb213-5bd4-dd4d-15b6-c4067489c076 eval_id=b22f27b2-a0ae-594c-b04b-55d77c767ff3 type=_core namespace=- job_id=deployment-gc node_id="" triggered_by=scheduled

Nomad Client logs (if appropriate)

jrasell commented 4 days ago

Hi @ngcmac and thanks for raising this issue. I have been unable to reproduce this within a local test setup using the steps you provided. Could you provide some additional information regarding the jobs, allocs, and similar objects that you are seeing a problem with?

Along with some base configuration the agent included the following server block entries:

server {
  job_gc_interval  = "5m"
  job_gc_threshold = "5m"
}

I then used this dispatch jobspec and ran several instances of the job in short succession:

job "dispatch" {
  type = "batch"

  parameterized {}

  group "dispatch" {
    count = 1

    task "redis" {
      driver = "docker"
      config {
        image   = "busybox"
        command = "echo"
        args    = ["done"]
      }
    }
  }
}
$ nomad status
ID                                     Type                 Priority  Status   Submit Date
dispatch                               batch/parameterized  50        running  2024-11-08T08:28:02Z
dispatch/dispatch-1731054489-23cae725  batch                50        dead     2024-11-08T08:28:09Z
dispatch/dispatch-1731054491-2dc553b1  batch                50        dead     2024-11-08T08:28:11Z
dispatch/dispatch-1731054493-a4729c5f  batch                50        dead     2024-11-08T08:28:13Z
dispatch/dispatch-1731054494-fd522a3c  batch                50        dead     2024-11-08T08:28:14Z
dispatch/dispatch-1731054496-131ea0f8  batch                50        dead     2024-11-08T08:28:16Z
dispatch/dispatch-1731054497-ac5e037b  batch                50        dead     2024-11-08T08:28:17Z

Once the jobs had moved to the complete state, I restarted my agent with current main at c5249c6ca4dae2cf1f157e88545428ccfd6cc4a7 and waited for the automatic GC interval and thresholds to pass. Once this had, the job status list shows the completed/dead jobs had been removed from the system:

$ nomad status
ID        Type                 Priority  Status   Submit Date
dispatch  batch/parameterized  50        running  2024-11-08T08:28:02Z

The summary metrics are also what I would expect:

$ curl -s localhost:4646/v1/metrics | jq '.Gauges | .[] | select(.Name | contains("nomad.nomad.job_status.dead")) |.Value '
0
$ curl -s localhost:4646/v1/metrics | jq '.Gauges | .[] | select(.Name | contains("nomad.nomad.job_status.running")) |.Value '
1
shantanugadgil commented 4 days ago

@ngcmac on a hunch, are you using prometheus way to collect the stats?

In the past, I too have experienced "servers leaking memory".

Example: https://github.com/hashicorp/nomad/issues/18113

just-a-thought