Open ngcmac opened 5 days ago
Hi @ngcmac and thanks for raising this issue. I have been unable to reproduce this within a local test setup using the steps you provided. Could you provide some additional information regarding the jobs, allocs, and similar objects that you are seeing a problem with?
Along with some base configuration the agent included the following server block entries:
server {
job_gc_interval = "5m"
job_gc_threshold = "5m"
}
I then used this dispatch jobspec and ran several instances of the job in short succession:
job "dispatch" {
type = "batch"
parameterized {}
group "dispatch" {
count = 1
task "redis" {
driver = "docker"
config {
image = "busybox"
command = "echo"
args = ["done"]
}
}
}
}
$ nomad status
ID Type Priority Status Submit Date
dispatch batch/parameterized 50 running 2024-11-08T08:28:02Z
dispatch/dispatch-1731054489-23cae725 batch 50 dead 2024-11-08T08:28:09Z
dispatch/dispatch-1731054491-2dc553b1 batch 50 dead 2024-11-08T08:28:11Z
dispatch/dispatch-1731054493-a4729c5f batch 50 dead 2024-11-08T08:28:13Z
dispatch/dispatch-1731054494-fd522a3c batch 50 dead 2024-11-08T08:28:14Z
dispatch/dispatch-1731054496-131ea0f8 batch 50 dead 2024-11-08T08:28:16Z
dispatch/dispatch-1731054497-ac5e037b batch 50 dead 2024-11-08T08:28:17Z
Once the jobs had moved to the complete state, I restarted my agent with current main at c5249c6ca4dae2cf1f157e88545428ccfd6cc4a7
and waited for the automatic GC interval and thresholds to pass. Once this had, the job status list shows the completed/dead jobs had been removed from the system:
$ nomad status
ID Type Priority Status Submit Date
dispatch batch/parameterized 50 running 2024-11-08T08:28:02Z
The summary metrics are also what I would expect:
$ curl -s localhost:4646/v1/metrics | jq '.Gauges | .[] | select(.Name | contains("nomad.nomad.job_status.dead")) |.Value '
0
$ curl -s localhost:4646/v1/metrics | jq '.Gauges | .[] | select(.Name | contains("nomad.nomad.job_status.running")) |.Value '
1
@ngcmac on a hunch, are you using prometheus
way to collect the stats?
In the past, I too have experienced "servers leaking memory".
Example: https://github.com/hashicorp/nomad/issues/18113
just-a-thought
Nomad version
1.9.1
Operating system and Environment details
Debian 12 (amd64) 1 server + 3 clients
Issue
After the upgrade from 1.8.3 to 1.9.1 in our CI environment, it seems that job_gc is not working. Our config:
"job_gc_interval": "1h"
"job_gc_threshold": "120h"
Metrics:
Reproduction steps
Upgrade to 1.9.1
Expected Result
job_gc keeps working as expected
Actual Result
job_gc not working
Job file (if appropriate)
Nomad Server logs (if appropriate)
I don't see nothing relevant indicating a problem, GC seems to find objects to collect:
Nomad Client logs (if appropriate)