hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.83k stars 1.95k forks source link

worker.service_sched: processing eval panicked scheduler - please report this as a bug #20157

Open dpotapov opened 6 months ago

dpotapov commented 6 months ago

Nomad version

Nomad v1.7.5 BuildDate 2024-02-13T15:10:13Z Revision 5f5d4646198d09b8f4f6cb90fb5d50b53fa328b8

Operating system and Environment details

RHEL 9.3

Issue

Evaluations for the job are failing:

{
  "priority": 50,
  "type": "service",
  "triggeredBy": "failed-follow-up",
  "status": "failed",
  "statusDescription": "evaluation reached delivery limit (3)",
  "failedTGAllocs": [],
  "previousEval": "34ab318b-a04d-a62b-48cc-604e265e4573",
  "nextEval": "15f7b8cf-8091-5abf-f3ed-28517af63b7a",
  "blockedEval": null,
  "modifyIndex": 39197810,
  "modifyTime": "2024-03-18T15:41:53.513Z",
  "createIndex": 39197798,
  "createTime": "2024-03-18T15:38:30.948Z",
  "waitUntil": null,
  "namespace": "default",
  "plainJobId": "exec-job",
  "relatedEvals": [
    "15f7b8cf-8091-5abf-f3ed-28517af63b7a",
    "34ab318b-a04d-a62b-48cc-604e265e4573",
    "70e22606-6d2c-b44f-8062-b3a7b5f7ca69",
    "6fa29ce2-9e16-2420-039e-7b5f8a4cd466",
    "9f1bfb6c-4984-9b6d-384e-2defd5f1a574",
    "7d362010-c877-74c9-56fc-b7b842688409",
    "cc10963b-9de1-8d6b-ec1c-eaabf0f3497a",
    "6503ee4a-5b86-d742-4597-cea96f18582e"
  ],
  "job": "[\"exec-job\",\"default\"]",
  "node": null
}

Reproduction steps

Nomad cluster was updated to 1.7.5

Expected Result

Jobs are evaluated and running

Actual Result

Jobs are never started

Job file (if appropriate)

Pretty much any job won't start.

Nomad Server logs (if appropriate)

    2024-03-18T15:25:14.668Z [ERROR] worker.service_sched: processing eval panicked scheduler - please report this as a bug!: eval_id=9f1bfb6c-4984-9b6d-384e-2defd5f1a574 job_id=exec-job namespace=default worker_id=0c9215c7-515a-eb81-7b10-a11f8abda944 eval_id=9f1bfb6c-4984-9b6d-384e-2defd5f1a574 error="runtime error: invalid memory address or nil pointer dereference"
  stack_trace=
  | goroutine 83 [running]:
  | runtime/debug.Stack()
  | \truntime/debug/stack.go:24 +0x5e
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process.func1()
  | \tgithub.com/hashicorp/nomad/scheduler/generic_sched.go:153 +0x58
  | panic({0x2a88140?, 0x4f5ea50?})
  | \truntime/panic.go:914 +0x21f
  | github.com/hashicorp/nomad/client/lib/numalib.(*Topology).UsableCores(...)
  | \tgithub.com/hashicorp/nomad/client/lib/numalib/topology.go:258
  | github.com/hashicorp/nomad/nomad/structs.(*NodeResources).Comparable(0xc001108c80)
  | \tgithub.com/hashicorp/nomad/nomad/structs/structs.go:3185 +0xcc
  | github.com/hashicorp/nomad/scheduler.(*Preemptor).SetNode(0xc0029c48f0, 0xc00cc18000)
  | \tgithub.com/hashicorp/nomad/scheduler/preemption.go:139 +0x36
  | github.com/hashicorp/nomad/scheduler.(*BinPackIterator).Next(0xc00c176a80)
  | \tgithub.com/hashicorp/nomad/scheduler/rank.go:274 +0x74d
  | github.com/hashicorp/nomad/scheduler.(*JobAntiAffinityIterator).Next(0xc00b367bd0)
  | \tgithub.com/hashicorp/nomad/scheduler/rank.go:624 +0x6b
  | github.com/hashicorp/nomad/scheduler.(*NodeReschedulingPenaltyIterator).Next(0xc00e4384e0)
  | \tgithub.com/hashicorp/nomad/scheduler/rank.go:685 +0x28
  | github.com/hashicorp/nomad/scheduler.(*NodeAffinityIterator).Next(0xc00b367c20)
  | \tgithub.com/hashicorp/nomad/scheduler/rank.go:757 +0x30
  | github.com/hashicorp/nomad/scheduler.(*SpreadIterator).Next(0xc00c176af0)
  | \tgithub.com/hashicorp/nomad/scheduler/spread.go:131 +0x33
  | github.com/hashicorp/nomad/scheduler.(*PreemptionScoringIterator).Next(0xc02e7cace0)
  | \tgithub.com/hashicorp/nomad/scheduler/rank.go:852 +0x28
  | github.com/hashicorp/nomad/scheduler.(*ScoreNormalizationIterator).Next(0xc02e7cad20)
  | \tgithub.com/hashicorp/nomad/scheduler/rank.go:816 +0x28
  | github.com/hashicorp/nomad/scheduler.(*LimitIterator).nextOption(0xc008a79aa0)
  | \tgithub.com/hashicorp/nomad/scheduler/select.go:63 +0x24
  | github.com/hashicorp/nomad/scheduler.(*LimitIterator).Next(0xc008a79aa0)
  | \tgithub.com/hashicorp/nomad/scheduler/select.go:42 +0x26
  | github.com/hashicorp/nomad/scheduler.(*MaxScoreIterator).Next(0xc00e438570)
  | \tgithub.com/hashicorp/nomad/scheduler/select.go:105 +0x3e
  | github.com/hashicorp/nomad/scheduler.(*GenericStack).Select(0xc0262d92b0, 0xc00c062b40, 0xc0029c5530)
  | \tgithub.com/hashicorp/nomad/scheduler/stack.go:192 +0xe8f
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).selectNextOption(0xc00985c000, 0x38264a0?, 0xc0029c5530)
  | \tgithub.com/hashicorp/nomad/scheduler/generic_sched.go:898 +0x2d
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computePlacements(0xc00985c000, {0x526ef20, 0x0, 0x0}, {0xc00b5c5740, 0x1, 0x1}, 0x0?)
  | \tgithub.com/hashicorp/nomad/scheduler/generic_sched.go:602 +0xa47
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).computeJobAllocs(0xc00985c000)
  | \tgithub.com/hashicorp/nomad/scheduler/generic_sched.go:469 +0x14da
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).process(0xc00985c000)
  | \tgithub.com/hashicorp/nomad/scheduler/generic_sched.go:289 +0x49a
  | github.com/hashicorp/nomad/scheduler.retryMax(0x5, 0xc0029c5d20, 0xc0029c5d10)
  | \tgithub.com/hashicorp/nomad/scheduler/util.go:96 +0x49
  | github.com/hashicorp/nomad/scheduler.(*GenericScheduler).Process(0xc00985c000, 0xc01c1d7680)
  | \tgithub.com/hashicorp/nomad/scheduler/generic_sched.go:188 +0x55f
  | github.com/hashicorp/nomad/nomad.(*Worker).invokeScheduler(0xc008e70ee0, 0xc0110d1e60, 0xc01c1d7680, {0xc02005da10, 0x24})
  | \tgithub.com/hashicorp/nomad/nomad/worker.go:634 +0x353
  | github.com/hashicorp/nomad/nomad.(*Worker).run(0xc008e70ee0, 0x12a05f200)
  | \tgithub.com/hashicorp/nomad/nomad/worker.go:463 +0x5a5
  | created by github.com/hashicorp/nomad/nomad.(*Worker).Start in goroutine 1
  | \tgithub.com/hashicorp/nomad/nomad/worker.go:162 +0x59

Nomad Client logs (if appropriate)

N/A

shoenig commented 6 months ago

@dpotapov what version of Nomad are you upgrading from?

And can you describe more about the runtime environment (like are you running clients in a VM? or what architecture? etc.)

dpotapov commented 6 months ago

from v1.1.4 servers and clients are amd64 VMs

I guess updating the nomad version on client should help...