hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.91k stars 1.95k forks source link

`nomad.nomad.blocked_evals.total_blocked` metric reports incorrect count #17722

Open marvinchin opened 1 year ago

marvinchin commented 1 year ago

Nomad version

1.5.5

Operating system and Environment details

Unix

Issue

The reported value of the nomad.nomad.blocked_evals.total_blocked metric becomes incorrect when a blocked evaluation for a system job is unblocked due to a node update. In particular, we've observed cases where the metric reported a negative count of blocked evaluations on the cluster which cannot be true.

The effect of this is that it's hard to rely on this metric to monitor the health/capacity of the cluster.

Reproduction steps

# Start a server
$ ./usr/bin/nomad agent -config server.conf
...

# Start a client
$ ./usr/bin/nomad agent -config client-1.conf
...

# Submit a job which requests for static port 10000 (doesn't have to be system)
$ cat networks.nomad
job "example" {
  datacenters = ["dc1"]
  type        = "system"

  group "test-group" {
    task "test-task" {
      driver = "raw_exec"

      config {
        command = "/usr/bin/sleep"
        args    = ["inf"]
      }

      kill_timeout = "1h"
    }

    network {
      port "test" {
        static = 10000
      }
    }
  }
}
$ ./usr/bin/nomad job run networks.nomad
==> 2023-06-23T15:02:06+08:00: Monitoring evaluation "b6c002c3"
    2023-06-23T15:02:06+08:00: Evaluation triggered by job "example"
    2023-06-23T15:02:07+08:00: Allocation "ecba59ef" created: node "dc25c96f", group "test-group"
    2023-06-23T15:02:07+08:00: Evaluation status changed: "pending" -> "complete"
==> 2023-06-23T15:02:07+08:00: Evaluation "b6c002c3" finished with status "complete"

# Submit a system job which also requests for static port 10000, and see that the evaluation is blocked
$ cat networks-2.nomad
job "example-2" {
  datacenters = ["dc1"]
  type        = "system"

  group "test-group" {
    task "test-task" {
      driver = "raw_exec"

      config {
        command = "/usr/bin/sleep"
        args    = ["inf"]
      }

      kill_timeout = "1h"
    }

    network {
      port "test" {
        static = 10000
      }
    }
  }
}
$ ./usr/bin/nomad job run networks-2.nomad
==> 2023-06-23T15:02:16+08:00: Monitoring evaluation "44f61301"
    2023-06-23T15:02:16+08:00: Evaluation triggered by job "example-2"
    2023-06-23T15:02:17+08:00: Evaluation status changed: "pending" -> "complete"
==> 2023-06-23T15:02:17+08:00: Evaluation "44f61301" finished with status "complete" but failed to place all allocations:
    2023-06-23T15:02:17+08:00: Task Group "test-group" (failed to place 1 allocation):

# Make the client ineligible and eligible again (I did this twice)
$ ./usr/bin/nomad node eligibility -disable dc25c96f
Node "dc25c96f-bff8-fe9c-f352-b912c350fe03" scheduling eligibility set: ineligible for scheduling
$ ./usr/bin/nomad node eligibility -enable dc25c96f
Node "dc25c96f-bff8-fe9c-f352-b912c350fe03" scheduling eligibility set: eligible for scheduling
$ ./usr/bin/nomad node eligibility -disable dc25c96f
Node "dc25c96f-bff8-fe9c-f352-b912c350fe03" scheduling eligibility set: ineligible for scheduling
$ ./usr/bin/nomad node eligibility -enable dc25c96f
Node "dc25c96f-bff8-fe9c-f352-b912c350fe03" scheduling eligibility set: eligible for scheduling

# See that the total_blocked metric reports a negative value, which is clearly wrong 
$ ./usr/bin/nomad operator api /v1/metrics | jq | grep blocked_evals.total_blocked -C 2
        "host": "$HOSTNAME"
      },
      "Name": "nomad.nomad.blocked_evals.total_blocked",
      "Value": -1
    },

# Compare it against the actual state of the cluster to be sure
$ ./usr/bin/nomad eval list
ID        Priority  Triggered By   Job ID     Namespace  Node ID   Status    Placement Failures
7dbf0548  50        queued-allocs  example-2  default    dc25c96f  blocked   N/A - In Progress
de7496b0  50        queued-allocs  example-2  default    dc25c96f  canceled  true
6129a784  50        node-update    example    default    dc25c96f  complete  false
89137480  50        node-update    example-2  default    dc25c96f  complete  true
30216c5a  50        queued-allocs  example-2  default    dc25c96f  complete  true
c2683219  50        queued-allocs  example-2  default    dc25c96f  canceled  true
5ae49553  50        node-update    example    default    dc25c96f  complete  false
b9346c86  50        node-update    example-2  default    dc25c96f  complete  true
0bd04195  50        queued-allocs  example-2  default    dc25c96f  complete  true
44f61301  50        job-register   example-2  default    <none>    complete  true
b6c002c3  50        job-register   example    default    <none>    complete  false

Suspected Root Cause

I attached delve to the server process to try and figure out where the accounting might have gone wrong:

Delve script and output ``` $ cat delve.script trace blocked_evals_stats.go:42 on 1 print "block" on 1 print eval.ID on 1 bt trace blocked_evals_stats.go:50 on 2 print "unblock" on 2 print eval.ID on 2 bt $ dlv attach 50485 --init <(cat delve.script) Type 'help' for list of commands. Tracepoint 1 set at 0x1d6d0a6 for github.com/hashicorp/nomad/nomad.(*BlockedStats).Block() github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:42 Tracepoint 2 set at 0x1d6d146 for github.com/hashicorp/nomad/nomad.(*BlockedStats).Unblock() github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:50 (dlv) continue > goroutine(131): github.com/hashicorp/nomad/nomad.(*BlockedStats).Unblock(("*github.com/hashicorp/nomad/nomad.BlockedStats")(0xc000b54160), ("*github.com/hashicorp/nomad/nomad/structs.Evaluation")(0xc000003c80)) "unblock": "unblock" eval.ID: "30216c5a-d608-d321-d41b-843defbb1281" Stack: 0 0x0000000001d6d146 in github.com/hashicorp/nomad/nomad.(*BlockedStats).Unblock at github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:50 1 0x0000000001d6a60b in github.com/hashicorp/nomad/nomad.(*BlockedEvals).UnblockNode at github.com/hashicorp/nomad/nomad/blocked_evals.go:515 2 0x0000000001db51ba in github.com/hashicorp/nomad/nomad.(*nomadFSM).applyNodeEligibilityUpdate at github.com/hashicorp/nomad/nomad/fsm.go:534 3 0x0000000001db20a5 in github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply at github.com/hashicorp/nomad/nomad/fsm.go:276 4 0x0000000001080735 in github.com/hashicorp/raft.(*Raft).runFSM.func1 at github.com/hashicorp/raft@v1.3.11/fsm.go:98 5 0x000000000108048f in github.com/hashicorp/raft.(*Raft).runFSM.func2 at github.com/hashicorp/raft@v1.3.11/fsm.go:121 6 0x000000000107f6fc in github.com/hashicorp/raft.(*Raft).runFSM at github.com/hashicorp/raft@v1.3.11/fsm.go:237 7 0x00000000010a4986 in github.com/hashicorp/raft.(*Raft).runFSM-fm at :1 8 0x00000000010a1222 in github.com/hashicorp/raft.(*raftState).goFunc.func1 at github.com/hashicorp/raft@v1.3.11/state.go:146 9 0x0000000000475041 in runtime.goexit at runtime/asm_amd64.s:1598 Stack: 0 0x0000000001d6d146 in github.com/hashicorp/nomad/nomad.(*BlockedStats).Unblock at github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:50 1 0x0000000001d6a60b in github.com/hashicorp/nomad/nomad.(*BlockedEvals).UnblockNode at github.com/hashicorp/nomad/nomad/blocked_evals.go:515 2 0x0000000001db51ba in github.com/hashicorp/nomad/nomad.(*nomadFSM).applyNodeEligibilityUpdate at github.com/hashicorp/nomad/nomad/fsm.go:534 3 0x0000000001db20a5 in github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply at github.com/hashicorp/nomad/nomad/fsm.go:276 4 0x0000000001080735 in github.com/hashicorp/raft.(*Raft).runFSM.func1 at github.com/hashicorp/raft@v1.3.11/fsm.go:98 5 0x000000000108048f in github.com/hashicorp/raft.(*Raft).runFSM.func2 at github.com/hashicorp/raft@v1.3.11/fsm.go:121 6 0x000000000107f6fc in github.com/hashicorp/raft.(*Raft).runFSM at github.com/hashicorp/raft@v1.3.11/fsm.go:237 7 0x00000000010a4986 in github.com/hashicorp/raft.(*Raft).runFSM-fm at :1 8 0x00000000010a1222 in github.com/hashicorp/raft.(*raftState).goFunc.func1 at github.com/hashicorp/raft@v1.3.11/state.go:146 9 0x0000000000475041 in runtime.goexit at runtime/asm_amd64.s:1598 > goroutine(95): github.com/hashicorp/nomad/nomad.(*BlockedStats).Unblock(("*github.com/hashicorp/nomad/nomad.BlockedStats")(0xc000b54160), ("*github.com/hashicorp/nomad/nomad/structs.Evaluation")(0xc000003c80)) "unblock": "unblock" eval.ID: "30216c5a-d608-d321-d41b-843defbb1281" Stack: 0 0x0000000001d6d146 in github.com/hashicorp/nomad/nomad.(*BlockedStats).Unblock at github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:50 1 0x0000000001d6ae6b in github.com/hashicorp/nomad/nomad.(*BlockedEvals).unblock at github.com/hashicorp/nomad/nomad/blocked_evals.go:592 2 0x0000000001d6a8f1 in github.com/hashicorp/nomad/nomad.(*BlockedEvals).watchCapacity at github.com/hashicorp/nomad/nomad/blocked_evals.go:529 3 0x0000000001d688ae in github.com/hashicorp/nomad/nomad.(*BlockedEvals).SetEnabled.func1 at github.com/hashicorp/nomad/nomad/blocked_evals.go:131 4 0x0000000000475041 in runtime.goexit at runtime/asm_amd64.s:1598 Stack: 0 0x0000000001d6d146 in github.com/hashicorp/nomad/nomad.(*BlockedStats).Unblock at github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:50 1 0x0000000001d6ae6b in github.com/hashicorp/nomad/nomad.(*BlockedEvals).unblock at github.com/hashicorp/nomad/nomad/blocked_evals.go:592 2 0x0000000001d6a8f1 in github.com/hashicorp/nomad/nomad.(*BlockedEvals).watchCapacity at github.com/hashicorp/nomad/nomad/blocked_evals.go:529 3 0x0000000001d688ae in github.com/hashicorp/nomad/nomad.(*BlockedEvals).SetEnabled.func1 at github.com/hashicorp/nomad/nomad/blocked_evals.go:131 4 0x0000000000475041 in runtime.goexit at runtime/asm_amd64.s:1598 > goroutine(131): github.com/hashicorp/nomad/nomad.(*BlockedStats).Block(("*github.com/hashicorp/nomad/nomad.BlockedStats")(0xc000b54160), ("*github.com/hashicorp/nomad/nomad/structs.Evaluation")(0xc000e80f00)) "block": "block" eval.ID: "de7496b0-c727-237d-eec5-309ee2e16e33" Stack: 0 0x0000000001d6d0a6 in github.com/hashicorp/nomad/nomad.(*BlockedStats).Block at github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:42 1 0x0000000001d68ca8 in github.com/hashicorp/nomad/nomad.(*BlockedEvals).processBlock at github.com/hashicorp/nomad/nomad/blocked_evals.go:197 2 0x0000000001db813e in github.com/hashicorp/nomad/nomad.(*BlockedEvals).Block at github.com/hashicorp/nomad/nomad/blocked_evals.go:152 3 0x0000000001db813e in github.com/hashicorp/nomad/nomad.(*nomadFSM).handleUpsertedEval at github.com/hashicorp/nomad/nomad/fsm.go:840 4 0x0000000001db7fef in github.com/hashicorp/nomad/nomad.(*nomadFSM).handleUpsertedEvals at github.com/hashicorp/nomad/nomad/fsm.go:827 5 0x0000000001db7fef in github.com/hashicorp/nomad/nomad.(*nomadFSM).upsertEvals at github.com/hashicorp/nomad/nomad/fsm.go:819 6 0x0000000001db7d0e in github.com/hashicorp/nomad/nomad.(*nomadFSM).applyUpdateEval at github.com/hashicorp/nomad/nomad/fsm.go:810 7 0x0000000001db1a67 in github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply at github.com/hashicorp/nomad/nomad/fsm.go:232 8 0x0000000001080735 in github.com/hashicorp/raft.(*Raft).runFSM.func1 at github.com/hashicorp/raft@v1.3.11/fsm.go:98 9 0x000000000108048f in github.com/hashicorp/raft.(*Raft).runFSM.func2 at github.com/hashicorp/raft@v1.3.11/fsm.go:121 10 0x000000000107f6fc in github.com/hashicorp/raft.(*Raft).runFSM at github.com/hashicorp/raft@v1.3.11/fsm.go:237 11 0x00000000010a4986 in github.com/hashicorp/raft.(*Raft).runFSM-fm at :1 12 0x00000000010a1222 in github.com/hashicorp/raft.(*raftState).goFunc.func1 at github.com/hashicorp/raft@v1.3.11/state.go:146 13 0x0000000000475041 in runtime.goexit at runtime/asm_amd64.s:1598 Stack: 0 0x0000000001d6d0a6 in github.com/hashicorp/nomad/nomad.(*BlockedStats).Block at github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:42 1 0x0000000001d68ca8 in github.com/hashicorp/nomad/nomad.(*BlockedEvals).processBlock at github.com/hashicorp/nomad/nomad/blocked_evals.go:197 2 0x0000000001db813e in github.com/hashicorp/nomad/nomad.(*BlockedEvals).Block at github.com/hashicorp/nomad/nomad/blocked_evals.go:152 3 0x0000000001db813e in github.com/hashicorp/nomad/nomad.(*nomadFSM).handleUpsertedEval at github.com/hashicorp/nomad/nomad/fsm.go:840 4 0x0000000001db7fef in github.com/hashicorp/nomad/nomad.(*nomadFSM).handleUpsertedEvals at github.com/hashicorp/nomad/nomad/fsm.go:827 5 0x0000000001db7fef in github.com/hashicorp/nomad/nomad.(*nomadFSM).upsertEvals at github.com/hashicorp/nomad/nomad/fsm.go:819 6 0x0000000001db7d0e in github.com/hashicorp/nomad/nomad.(*nomadFSM).applyUpdateEval at github.com/hashicorp/nomad/nomad/fsm.go:810 7 0x0000000001db1a67 in github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply at github.com/hashicorp/nomad/nomad/fsm.go:232 8 0x0000000001080735 in github.com/hashicorp/raft.(*Raft).runFSM.func1 at github.com/hashicorp/raft@v1.3.11/fsm.go:98 9 0x000000000108048f in github.com/hashicorp/raft.(*Raft).runFSM.func2 at github.com/hashicorp/raft@v1.3.11/fsm.go:121 10 0x000000000107f6fc in github.com/hashicorp/raft.(*Raft).runFSM at github.com/hashicorp/raft@v1.3.11/fsm.go:237 11 0x00000000010a4986 in github.com/hashicorp/raft.(*Raft).runFSM-fm at :1 12 0x00000000010a1222 in github.com/hashicorp/raft.(*raftState).goFunc.func1 at github.com/hashicorp/raft@v1.3.11/state.go:146 13 0x0000000000475041 in runtime.goexit at runtime/asm_amd64.s:1598 > goroutine(131): github.com/hashicorp/nomad/nomad.(*BlockedStats).Unblock(("*github.com/hashicorp/nomad/nomad.BlockedStats")(0xc000b54160), ("*github.com/hashicorp/nomad/nomad/structs.Evaluation")(0xc000e80f00)) "unblock": "unblock" eval.ID: "de7496b0-c727-237d-eec5-309ee2e16e33" Stack: 0 0x0000000001d6d146 in github.com/hashicorp/nomad/nomad.(*BlockedStats).Unblock at github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:50 1 0x0000000001d690c5 in github.com/hashicorp/nomad/nomad.(*BlockedEvals).processBlockJobDuplicate at github.com/hashicorp/nomad/nomad/blocked_evals.go:252 2 0x0000000001d68afd in github.com/hashicorp/nomad/nomad.(*BlockedEvals).processBlock at github.com/hashicorp/nomad/nomad/blocked_evals.go:176 3 0x0000000001db813e in github.com/hashicorp/nomad/nomad.(*BlockedEvals).Block at github.com/hashicorp/nomad/nomad/blocked_evals.go:152 4 0x0000000001db813e in github.com/hashicorp/nomad/nomad.(*nomadFSM).handleUpsertedEval at github.com/hashicorp/nomad/nomad/fsm.go:840 5 0x0000000001db7fef in github.com/hashicorp/nomad/nomad.(*nomadFSM).handleUpsertedEvals at github.com/hashicorp/nomad/nomad/fsm.go:827 6 0x0000000001db7fef in github.com/hashicorp/nomad/nomad.(*nomadFSM).upsertEvals at github.com/hashicorp/nomad/nomad/fsm.go:819 7 0x0000000001db7d0e in github.com/hashicorp/nomad/nomad.(*nomadFSM).applyUpdateEval at github.com/hashicorp/nomad/nomad/fsm.go:810 8 0x0000000001db1a67 in github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply at github.com/hashicorp/nomad/nomad/fsm.go:232 9 0x0000000001080735 in github.com/hashicorp/raft.(*Raft).runFSM.func1 at github.com/hashicorp/raft@v1.3.11/fsm.go:98 10 0x000000000108048f in github.com/hashicorp/raft.(*Raft).runFSM.func2 at github.com/hashicorp/raft@v1.3.11/fsm.go:121 11 0x000000000107f6fc in github.com/hashicorp/raft.(*Raft).runFSM at github.com/hashicorp/raft@v1.3.11/fsm.go:237 12 0x00000000010a4986 in github.com/hashicorp/raft.(*Raft).runFSM-fm at :1 13 0x00000000010a1222 in github.com/hashicorp/raft.(*raftState).goFunc.func1 at github.com/hashicorp/raft@v1.3.11/state.go:146 14 0x0000000000475041 in runtime.goexit at runtime/asm_amd64.s:1598 Stack: 0 0x0000000001d6d146 in github.com/hashicorp/nomad/nomad.(*BlockedStats).Unblock at github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:50 1 0x0000000001d690c5 in github.com/hashicorp/nomad/nomad.(*BlockedEvals).processBlockJobDuplicate at github.com/hashicorp/nomad/nomad/blocked_evals.go:252 2 0x0000000001d68afd in github.com/hashicorp/nomad/nomad.(*BlockedEvals).processBlock at github.com/hashicorp/nomad/nomad/blocked_evals.go:176 3 0x0000000001db813e in github.com/hashicorp/nomad/nomad.(*BlockedEvals).Block at github.com/hashicorp/nomad/nomad/blocked_evals.go:152 4 0x0000000001db813e in github.com/hashicorp/nomad/nomad.(*nomadFSM).handleUpsertedEval at github.com/hashicorp/nomad/nomad/fsm.go:840 5 0x0000000001db7fef in github.com/hashicorp/nomad/nomad.(*nomadFSM).handleUpsertedEvals at github.com/hashicorp/nomad/nomad/fsm.go:827 6 0x0000000001db7fef in github.com/hashicorp/nomad/nomad.(*nomadFSM).upsertEvals at github.com/hashicorp/nomad/nomad/fsm.go:819 7 0x0000000001db7d0e in github.com/hashicorp/nomad/nomad.(*nomadFSM).applyUpdateEval at github.com/hashicorp/nomad/nomad/fsm.go:810 8 0x0000000001db1a67 in github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply at github.com/hashicorp/nomad/nomad/fsm.go:232 9 0x0000000001080735 in github.com/hashicorp/raft.(*Raft).runFSM.func1 at github.com/hashicorp/raft@v1.3.11/fsm.go:98 10 0x000000000108048f in github.com/hashicorp/raft.(*Raft).runFSM.func2 at github.com/hashicorp/raft@v1.3.11/fsm.go:121 11 0x000000000107f6fc in github.com/hashicorp/raft.(*Raft).runFSM at github.com/hashicorp/raft@v1.3.11/fsm.go:237 12 0x00000000010a4986 in github.com/hashicorp/raft.(*Raft).runFSM-fm at :1 13 0x00000000010a1222 in github.com/hashicorp/raft.(*raftState).goFunc.func1 at github.com/hashicorp/raft@v1.3.11/state.go:146 14 0x0000000000475041 in runtime.goexit at runtime/asm_amd64.s:1598 > goroutine(131): github.com/hashicorp/nomad/nomad.(*BlockedStats).Block(("*github.com/hashicorp/nomad/nomad.BlockedStats")(0xc000b54160), ("*github.com/hashicorp/nomad/nomad/structs.Evaluation")(0xc0005ccd80)) "block": "block" eval.ID: "7dbf0548-7404-a4e4-a0f9-a9e19627e8cb" Stack: 0 0x0000000001d6d0a6 in github.com/hashicorp/nomad/nomad.(*BlockedStats).Block at github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:42 1 0x0000000001d68ca8 in github.com/hashicorp/nomad/nomad.(*BlockedEvals).processBlock at github.com/hashicorp/nomad/nomad/blocked_evals.go:197 2 0x0000000001db813e in github.com/hashicorp/nomad/nomad.(*BlockedEvals).Block at github.com/hashicorp/nomad/nomad/blocked_evals.go:152 3 0x0000000001db813e in github.com/hashicorp/nomad/nomad.(*nomadFSM).handleUpsertedEval at github.com/hashicorp/nomad/nomad/fsm.go:840 4 0x0000000001db7fef in github.com/hashicorp/nomad/nomad.(*nomadFSM).handleUpsertedEvals at github.com/hashicorp/nomad/nomad/fsm.go:827 5 0x0000000001db7fef in github.com/hashicorp/nomad/nomad.(*nomadFSM).upsertEvals at github.com/hashicorp/nomad/nomad/fsm.go:819 6 0x0000000001db7d0e in github.com/hashicorp/nomad/nomad.(*nomadFSM).applyUpdateEval at github.com/hashicorp/nomad/nomad/fsm.go:810 7 0x0000000001db1a67 in github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply at github.com/hashicorp/nomad/nomad/fsm.go:232 8 0x0000000001080735 in github.com/hashicorp/raft.(*Raft).runFSM.func1 at github.com/hashicorp/raft@v1.3.11/fsm.go:98 9 0x000000000108048f in github.com/hashicorp/raft.(*Raft).runFSM.func2 at github.com/hashicorp/raft@v1.3.11/fsm.go:121 10 0x000000000107f6fc in github.com/hashicorp/raft.(*Raft).runFSM at github.com/hashicorp/raft@v1.3.11/fsm.go:237 11 0x00000000010a4986 in github.com/hashicorp/raft.(*Raft).runFSM-fm at :1 12 0x00000000010a1222 in github.com/hashicorp/raft.(*raftState).goFunc.func1 at github.com/hashicorp/raft@v1.3.11/state.go:146 13 0x0000000000475041 in runtime.goexit at runtime/asm_amd64.s:1598 Stack: 0 0x0000000001d6d0a6 in github.com/hashicorp/nomad/nomad.(*BlockedStats).Block at github.com/hashicorp/nomad/nomad/blocked_evals_stats.go:42 1 0x0000000001d68ca8 in github.com/hashicorp/nomad/nomad.(*BlockedEvals).processBlock at github.com/hashicorp/nomad/nomad/blocked_evals.go:197 2 0x0000000001db813e in github.com/hashicorp/nomad/nomad.(*BlockedEvals).Block at github.com/hashicorp/nomad/nomad/blocked_evals.go:152 3 0x0000000001db813e in github.com/hashicorp/nomad/nomad.(*nomadFSM).handleUpsertedEval at github.com/hashicorp/nomad/nomad/fsm.go:840 4 0x0000000001db7fef in github.com/hashicorp/nomad/nomad.(*nomadFSM).handleUpsertedEvals at github.com/hashicorp/nomad/nomad/fsm.go:827 5 0x0000000001db7fef in github.com/hashicorp/nomad/nomad.(*nomadFSM).upsertEvals at github.com/hashicorp/nomad/nomad/fsm.go:819 6 0x0000000001db7d0e in github.com/hashicorp/nomad/nomad.(*nomadFSM).applyUpdateEval at github.com/hashicorp/nomad/nomad/fsm.go:810 7 0x0000000001db1a67 in github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply at github.com/hashicorp/nomad/nomad/fsm.go:232 8 0x0000000001080735 in github.com/hashicorp/raft.(*Raft).runFSM.func1 at github.com/hashicorp/raft@v1.3.11/fsm.go:98 9 0x000000000108048f in github.com/hashicorp/raft.(*Raft).runFSM.func2 at github.com/hashicorp/raft@v1.3.11/fsm.go:121 10 0x000000000107f6fc in github.com/hashicorp/raft.(*Raft).runFSM at github.com/hashicorp/raft@v1.3.11/fsm.go:237 11 0x00000000010a4986 in github.com/hashicorp/raft.(*Raft).runFSM-fm at :1 12 0x00000000010a1222 in github.com/hashicorp/raft.(*raftState).goFunc.func1 at github.com/hashicorp/raft@v1.3.11/state.go:146 13 0x0000000000475041 in runtime.goexit at runtime/asm_amd64.s:1598 ```

Interestingly, we see 30216c5a-d608-d321-d41b-843defbb1281 get unblocked twice, once via UnblockNode and once via watchCapacity.

Looking at processBlock, where b.stats.Block was invoked, we see that it tracks the blocked system evaluation in 3 places:

However, in UnblockNode, when we unblock the the system evaluation we only remove it from b.system but not b.captured: https://github.com/hashicorp/nomad/blob/3d63bc62b35cbe3f79cdd245d50b61f130ee1a79/nomad/blocked_evals.go#L513-L516

And in unblock which is called by watchCapacity, it iterates over all the evaluations in b.captured, which would include the system evaluation which might have already been unblocked by UnblockNode as described above: https://github.com/hashicorp/nomad/blob/3d63bc62b35cbe3f79cdd245d50b61f130ee1a79/nomad/blocked_evals.go#L566-L585

Which leads to b.stats.Unblock to be called on the same evaluation twice.

Other Potential Issues

We noticed a similar pattern happening in the following places:

We're not certain if these other cases could also be triggered in a way that cause double unblock accounting, but we thought it was worth mentioning.

Expected Result

The number of blocked evals as reported by the nomad.nomad.blocked_evals.total_blocked metric should be correct.

Actual Result

The number of blocked evals reported is incorrect.

lgfa29 commented 1 year ago

Hi @marvinchin 👋

Thank you for the detailed report! I was able to easily reproduce the issue following the steps you provided. As you noted, the miscount only happens if the blocked eval is for a system job, for service jobs the accounting seems to be correct.

A few possible solutions pop to mind, but I'm not 100% sure what's the best approach. Preventing multiple calls to Block()/Unblock() is the more direct approach, but potentially brittle as it's easy to accidentally call them multiple times again.

Making these function somewhat idempotent by having a small dedup cache in BlockedStats could help as well, but then it open questions about cache size, eviction etc.

I added this to our board for further triaging and roadmapping. Thank you again for the report!

marvinchin commented 1 year ago

Thanks for looking into this!

I agree that ensuring this can never happen in all cases sounds more involved.

I suspect that in this specific case if we call b.tracked.remove(e) in UnblockNode that should fix the issue for now. If you agree, I'm happy to submit a PR that makes this change.

lgfa29 commented 1 year ago

Hi @marvinchin, apologies for the delay here. I was familiarizing myself with this part of the code and also checking with the team for extra guidance.

I think your assertion is correct, we should be removing evals from b.capture on UnblockNode, but, more broadly, I was trying to understand the general relationship between b.jobs, b.captured, b.escaped, and b.system. Looking at the code it seems like the relations are something like this:

So b.jobs is a superset of all blocked evals, b.captured and b.escaped are disjoint sets, and b.system is subset of b.captured.

blocke_eval_1

One thing the team noticed is that it would make more sense for b.system to overlap both b.captured and b.escaped, and so this block of code should come earlier: https://github.com/hashicorp/nomad/blob/b17c0f7ff9fea7bfc465e6e0ffa959ffa2662c89/nomad/blocked_evals.go#L223-L227

But I think we can put a pin on that for now.

What these relationships illustrate is that, in order to achieve correctness, we must always keep these maps consistent, so removing something from b.capture should always be followed by a removal in b.system and b.jobs. The same apply for b.system, but, as you noticed, there are a few places that we don't do this 😅

So during the eval unblock something like this happens:

blocke_eval_2

The negative metric reported here is caused by the overlap of the two hatched areas, where we decrement the metric twice, but there also two other regions where we leave the maps in an inconsistent state because we only remove evals from b.capture or b.system:

blocke_eval_3

In practice,I'm not sure if the rightmost region actually exists because we call Unblock() in a computed class that the node ID used in UnblockNode() also has: https://github.com/hashicorp/nomad/blob/b17c0f7ff9fea7bfc465e6e0ffa959ffa2662c89/nomad/fsm.go#L465-L466

But maybe in concurrent operations we may end up with UnblockNode being called "first", in which case the leftmost region wouldn't exist.

All of these to say that I think we would need to make sure a delete(b.captured, ...) is always followed by a b.system.Remove(...), and the other way around as well. For this bug in specific it would look something like this: https://github.com/hashicorp/nomad/commit/a73cbafda1dbc34d7357e7842e033ee0013bd08d

The TotalQuotaLimit is unrelated but also seemed wrong to me, feel free to ignore it in your work 😄

Does this all make sense to you?

marvinchin commented 1 year ago

Sorry, I somehow missed this notification. Thanks for the explanation - I'll put up a PR for this soon.

lgfa29 commented 1 year ago

Awesome, thank you!

Let us know if you have any other question 🙂

marvinchin commented 1 year ago

While trying to implement this I came across something I didn't quite understand:

https://github.com/hashicorp/nomad/blob/e8efe2d251bf3628f13c7eb3ce2422eb7e5b85f6/nomad/blocked_evals.go#L367-L390

In L367 b.system.JobEvals returns a list of evals, which implies that a single system job might have multiple blocked evals. However in L376 b.jobs is a map of job id to eval, which implies that every job should have at most one blocked eval. I'm a little confused by this - do you happen to know what might be happening here?

Empirically, when I tried running a system job that had placement failures on two nodes, I only see one blocked evaluation being created which is attributed to only one of the two nodes. Dlv also shows that b.system.byJob contains only one evaluation for the single node ID. I don't happen to have the output handy, but I'm happy to reproduce this again and share the output if you think it'll be helpful.

lgfa29 commented 11 months ago

Hi @marvinchin, apologies for the delay here, Nomad 1.7 has kept me quite busy for the past weeks 😅

In L367 b.system.JobEvals returns a list of evals, which implies that a single system job might have multiple blocked evals. However in L376 b.jobs is a map of job id to eval, which implies that every job should have at most one blocked eval. I'm a little confused by this - do you happen to know what might be happening here?

I believe both assertions are correct, but they apply for different job types. System jobs have one blocked per eligible node, so, in total, a system job may have multiple blocked evals. I believe that's why there's an early return there in L372. If the job is in b.system.JobEvals then it's a system job, so it's handled differently.

The reason why a system job needs one blocked eval per eligible node is so that the reconciler is triggered if something changes in any of the nodes.

Service jobs however, can only have one blocked eval because the Nomad scheduler reconciler always processes all allocs, so a single eval will do.

Empirically, when I tried running a system job that had placement failures on two nodes, I only see one blocked evaluation being created which is attributed to only one of the two nodes.

Now, this is a bit surprising 🤔

Were both nodes configured the same? Meaning, were they both supposed to be eligible for placement of the system job?

marvinchin commented 10 months ago

Sorry for the delay here, it's been a busy couple of weeks!

Were both nodes configured the same? Meaning, were they both supposed to be eligible for placement of the system job? I recall that that's the case. Let me try to provide you with a repro that shows this in the next couple of days.

marvinchin commented 10 months ago

Here's a reproduction:

# Start a server
$ ./usr/bin/nomad agent -config server.conf
...

# Start two clients which are identical in configuration except for the ports/names
$ diff client-1.conf client-2.conf
1c1
< data_dir = "/tmp/nomad/data/client-1"
---
> data_dir = "/tmp/nomad/data/client-2"
16,18c16,18
<   http = "9901"
<   rpc = "9902"
<   serf = "9903"
---
>   http = "9801"
>   rpc = "9802"
>   serf = "9803"
$ ./usr/bin/nomad agent -config client-1.conf
...
$ ./usr/bin/nomad agent -config client-2.conf
...

# Submit a system job that requests for a static port
$ cat blocker.nomad 
job "blocker" {
  datacenters = ["dc1"]
  type        = "system"

  group "test-group" {
    task "test-task" {
      driver = "raw_exec"

      config {
        command = "/usr/bin/sleep"
        args    = ["inf"]
      }

      kill_timeout = "1h"
    }

    network {
      port "test" {
        static = 10000
      }
    }
  }
}
$ ./usr/bin/nomad job run blocker.nomad
==> 2023-12-29T18:06:11+08:00: Monitoring evaluation "79aa44b8"
    2023-12-29T18:06:11+08:00: Evaluation triggered by job "blocker"
    2023-12-29T18:06:12+08:00: Allocation "c4bbab73" created: node "c11029b2", group "test-group"
    2023-12-29T18:06:12+08:00: Allocation "e18bc370" created: node "90844561", group "test-group"
    2023-12-29T18:06:12+08:00: Evaluation status changed: "pending" -> "complete"
==> 2023-12-29T18:06:12+08:00: Evaluation "79aa44b8" finished with status "complete"

# Run another system job that requests for the same port (in our case, identical except for the job name), and see that it gets blocked
$ diff blocker.nomad blocked.nomad
1c1
< job "blocker" {
---
> job "blocked" {
$ ./usr/bin/nomad job run blocked.nomad
==> 2023-12-29T18:06:20+08:00: Monitoring evaluation "ad3046c2"
    2023-12-29T18:06:20+08:00: Evaluation triggered by job "blocked"
    2023-12-29T18:06:21+08:00: Evaluation status changed: "pending" -> "complete"
==> 2023-12-29T18:06:21+08:00: Evaluation "ad3046c2" finished with status "complete" but failed to place all allocations:
    2023-12-29T18:06:21+08:00: Task Group "test-group" (failed to place 2 allocations):

# List the allocations and see that there is only one blocked evaluation for one of the two nodes
$ ./usr/bin/nomad eval list
ID        Priority  Triggered By   Job ID   Namespace  Node ID   Status    Placement Failures
1773aff2  50        queued-allocs  blocked  default    90844561  blocked   N/A - In Progress
ad3046c2  50        job-register   blocked  default    <none>    complete  true
79aa44b8  50        job-register   blocker  default    <none>    complete  false

Then, I attached dlv to the server process and inspected the state of a goroutine inside of BlockedEvals:

(dlv) bt
0  0x0000000000442e4e in runtime.gopark
   at runtime/proc.go:399
1  0x00000000004535c5 in runtime.selectgo
   at runtime/select.go:327
2  0x0000000001db7890 in github.com/hashicorp/nomad/nomad.(*BlockedEvals).GetDuplicates
   at github.com/hashicorp/nomad/nomad/blocked_evals.go:679
3  0x0000000001e3cf05 in github.com/hashicorp/nomad/nomad.(*Server).reapDupBlockedEvaluations
   at github.com/hashicorp/nomad/nomad/leader.go:1163
4  0x0000000001e37d25 in github.com/hashicorp/nomad/nomad.(*Server).establishLeadership.func6
   at github.com/hashicorp/nomad/nomad/leader.go:386
5  0x0000000000476b21 in runtime.goexit
   at runtime/asm_amd64.s:1650
(dlv) p b.system.byJob
map[github.com/hashicorp/nomad/nomad/structs.NamespacedID]map[string]string [
    {ID: "blocked", Namespace: "default"}: [
        "90844561-7e2c-ed7d-1a4f-d16f65606b5b": "1773aff2-0ccf-27ec-9547-9ea75739786a", 
    ], 
]

Again, we see only one evaluation.

lgfa29 commented 10 months ago

Oh that's right. I think my mental model around this area was wrong. In the scenario you described only a single eval is created because the task group has already been marked as failed: https://github.com/hashicorp/nomad/blob/e7ca2b51ad1bde9d128d35108587e5eedd0b4458/scheduler/scheduler_system.go#L411-L416

But even if the job has multiple task groups only a single blocked eval is present because of the processBlockJobDuplicate method. https://github.com/hashicorp/nomad/blob/e7ca2b51ad1bde9d128d35108587e5eedd0b4458/nomad/blocked_evals.go#L234-L288

Evals added to b.duplicates are eventually cancelled by the leader in reapDupBlockedEvaluations. https://github.com/hashicorp/nomad/blob/e7ca2b51ad1bde9d128d35108587e5eedd0b4458/nomad/leader.go#L1216-L1250

So I think there may be a small windows of time where a job can have multiple blocked evals until the leader reaps the duplicated ones. b.system.JobEvals() also doesn't read directly from the state store, so there may a short window of time where it's not consistent with global state. Since b.stats.Unblock() kind of operates on the same level as b.system.JobEvals() (a kind of "cache" on top of global state) it seems OK to keep the loop there, even though we usually only expect there to be one eval.