bug: batch job reap failed: error="rpc error: Permission denied"

alexdulin commented 4 years ago

Nomad version

Nomad v0.12.2 (ee69b3379aeced67e14943b86c4f621451e64e84)

Operating system and Environment details

Linux ip-172-31-63-158 4.15.0-1079-aws #83~16.04.1-Ubuntu SMP Tue Jul 14 15:31:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Issue

Nomad servers are failing to reap jobs due to RPC errors. We have been observing these errors in the logs since 0.11.0 and I was hoping that the fixes to CSI issues in 0.12.2 might fix them, but it does not appear that is the case. It is not clear that the warnings from the leader's logs are related to what is showing in the follower servers, but they occur at the same frequency and have not stopped in months.

The cluster has full ACLs, TLS, and RPC encryption in place.

Nomad Server logs (if appropriate)

Logs from follower servers

2020-08-12T18:44:08.570Z [DEBUG] worker: dequeued evaluation: eval_id=59ceb448-6d53-cf8b-6dd3-abe3e8e51357
2020-08-12T18:44:08.570Z [DEBUG] core.sched: forced job GC
2020-08-12T18:44:08.570Z [DEBUG] core.sched: job GC found eligible objects: jobs=6 evals=0 allocs=0
2020-08-12T18:44:08.570Z [ERROR] core.sched: batch job reap failed: error="rpc error: Permission denied"
2020-08-12T18:44:08.570Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: rpc error: Permission denied"
2020-08-12T18:44:08.571Z [DEBUG] worker: nack evaluation: eval_id=59ceb448-6d53-cf8b-6dd3-abe3e8e51357

Logs from leader

2020-08-12T18:40:35.083Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval "d3757b1b-8cb4-b14e-9e31-0e4da81b7974" JobID: "force-gc" Namespace: "-">"
2020-08-12T18:40:35.162Z [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:57692
2020-08-12T18:40:37.073Z [DEBUG] http: request complete: method=GET path=/v1/status/leader?namespace=default&wait=60000ms duration=112.318µs
2020-08-12T18:40:40.384Z [DEBUG] http: request complete: method=GET path=/v1/metrics?format=prometheus duration=4.110991ms
2020-08-12T18:40:45.162Z [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:57700
2020-08-12T18:40:47.082Z [DEBUG] http: request complete: method=GET path=/v1/status/leader?namespace=default&wait=60000ms duration=108.186µs
2020-08-12T18:40:50.384Z [DEBUG] http: request complete: method=GET path=/v1/metrics?format=prometheus duration=4.157778ms
2020-08-12T18:40:55.162Z [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:57704
2020-08-12T18:40:56.433Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval "28722e8d-c558-df05-c76d-f5edc762ea2b" JobID: "force-gc" Namespace: "-">"
2020-08-12T18:41:00.385Z [DEBUG] http: request complete: method=GET path=/v1/metrics?format=prometheus duration=4.377928ms
2020-08-12T18:41:05.163Z [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:57712
2020-08-12T18:41:05.909Z [DEBUG] nomad: memberlist: Initiating push/pull sync with: [REDACTED] [REDACTED]:4648
2020-08-12T18:41:10.384Z [DEBUG] http: request complete: method=GET path=/v1/metrics?format=prometheus duration=4.31638ms
2020-08-12T18:41:15.163Z [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:57720
2020-08-12T18:41:17.782Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval "fbe4bec8-7325-150c-b919-2cb82664993c" JobID: "force-gc" Namespace: "-">"

tgross commented 4 years ago

Hi @alexdulin!

We have been observing these errors in the logs since 0.11.0 and I was hoping that the fixes to CSI issues in 0.12.2 might fix them

Just for clarity, are these batch jobs ones that had CSI volumes attached?

alexdulin commented 4 years ago

No they are not. We have not had CSI volumes/plugins registered for a long time (since it was initially released). After trying out CSI volumes we started to receive many errors that only finally stopped after the 0.12.2 release (thank you very much for that by the way), but those were different than what I posted above.

tgross commented 4 years ago

Ok, thanks. We'll look into the ACL errors you're seeing.

tgross commented 4 years ago

I just took a quick pass through the section of code involved. The GC job passes the leader's ACL token to the batch deregister RPC.

Resolving the token can't raise a "permission denied", only a "token not found" (and even that can't be hit if we have the legit leader ACL token as expected). So it looks the permission denied error is bubbling up from (*Job) BatchDeregister, where we're checking the submit job permissions for the namespaces.

Is this cluster running Nomad Enterprise by chance? Or if not, did this cluster used to be running Nomad Enterprise?

alexdulin commented 4 years ago

No, its running the open source version. Never was enterprise.

tgross commented 4 years ago

Understood. The log entry says it's trying to reap 6 jobs:

core.sched: job GC found eligible objects: jobs=6 evals=0 allocs=0

Give that you said this has been going on for a while, do you have a particular set of jobs that you'd expect to have been garbage collected but aren't getting reaped? If so, would you be willing to share the results of the Job API against some of those jobs? It might help to reproduce the issue at our end.

alexdulin commented 4 years ago

At one point I tried to fix this by deleting all of our batch jobs, waiting for them to be fully GC'ed (waited 72 hours), and then seeing if the logs continued, and they did. Since then we have re-enabled the jobs, but there is no specific set that would strike me as causing this.

Our of curiosity, do forced garbage collection get considered as a batch job? The warnings that are being logged by the raft leader seem to indicate that forced GCs are at least a job in terms of an evaluation:

2020-08-12T18:40:56.433Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval "28722e8d-c558-df05-c76d-f5edc762ea2b" JobID: "force-gc" Namespace: "-">"

When we were having the initial CSI errors, GCs were failing due to volumes still be associated with allocations (nodes failed to drain too, which has been documented in other issues). Is it possible these six "batch jobs" are six failed force GCs that are stuck in the bolt database or something?

tgross commented 4 years ago

Our of curiosity, do forced garbage collection get considered as a batch job?

That's a good insight! Almost: Nomad runs GC tasks as "core" jobs, which are similar to batch jobs but their evaluations result in internal functions being executed instead of running workloads on clients. But yes they do result in evaluations we have to process.

So from the logs what I'm seeing is the following:

The eval is for core job force-gc,
which runs the (*CoreScheduler) jobGC() function,
which hits this line doing a job reap,
which triggers the Job.BatchDeregister RPC,
which hits the ErrPermissionDenied error,
which gets wrapped in "batch job reap failed" and logged.

The log entry there is maybe a little misleading in wording. The thing that failed is a "reaping a batch of jobs", not "reaping a batch job".

The ACL the core jobs get is a "management" token. I'll see if I can get the ACLs into a state where the management token is rejected for some reason.

alexdulin commented 4 years ago

This might be unrelated to the current issue, but after upgrading to 0.12.3 we are now seeing CSI errors in the logs again:

2020-08-16T21:29:55.601Z [DEBUG] worker: dequeued evaluation: eval_id=99cec766-26ad-b0cb-2094-082344eff89c
2020-08-16T21:29:55.601Z [DEBUG] core.sched: CSI plugin GC scanning before cutoff index: index=6553862 csi_plugin_gc_threshold=1h0m0s
2020-08-16T21:29:55.601Z [ERROR] core.sched: failed to GC plugin: plugin_id=[REDACTED] error="rpc error: Permission denied"
2020-08-16T21:29:55.601Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: rpc error: Permission denied"
2020-08-16T21:29:55.602Z [DEBUG] worker: nack evaluation: eval_id=99cec766-26ad-b0cb-2094-082344eff89c
2020-08-16T21:29:55.608Z [DEBUG] worker: dequeued evaluation: eval_id=4b006614-fc24-1e9b-f943-ed9403e8579b
2020-08-16T21:29:55.948Z [DEBUG] core.sched: CSI plugin GC scanning before cutoff index: index=6553862 csi_plugin_gc_threshold=1h0m0s
2020-08-16T21:29:55.949Z [ERROR] core.sched: failed to GC plugin: plugin_id=[REDACTED] error="rpc error: Permission denied"
2020-08-16T21:29:55.949Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: rpc error: Permission denied"
2020-08-16T21:29:55.950Z [DEBUG] worker: nack evaluation: eval_id=4b006614-fc24-1e9b-f943-ed9403e8579b
2020-08-16T21:29:56.953Z [DEBUG] worker: dequeued evaluation: eval_id=4b006614-fc24-1e9b-f943-ed9403e8579b
2020-08-16T21:29:56.953Z [DEBUG] core.sched: CSI plugin GC scanning before cutoff index: index=6553862 csi_plugin_gc_threshold=1h0m0s
2020-08-16T21:29:56.953Z [ERROR] core.sched: failed to GC plugin: plugin_id=[REDACTED] error="rpc error: Permission denied"
2020-08-16T21:29:56.954Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: rpc error: Permission denied"
2020-08-16T21:29:56.954Z [DEBUG] worker: nack evaluation: eval_id=4b006614-fc24-1e9b-f943-ed9403e8579b

That is from a follower server. We do not have any CSI plugins running or any CSI volumes registered, but we do still have many old CSI related evaluations left over from when we first tried it out, like the following:

{
    "ID": "ff91f2c4-374e-f2f6-81a0-9b8e69543875",
    "Namespace": "default",
    "Priority": 200,
    "Type": "_core",
    "TriggeredBy": "failed-follow-up",
    "JobID": "csi-volume-claim-gc:[REDACTED]:no",
    "Status": "pending",
    "Wait": 341929846579,
    "PreviousEval": "a708d4e5-0a22-ab2c-d494-18d70209131f",
    "CreateIndex": 6497734,
    "ModifyIndex": 6497734,
    "CreateTime": 1597254886540115403,
    "ModifyTime": 1597254886540115403
}

And after upgrading the rest of the cluster to 0.12.3, once a new leader was elected this started to show up in its logs:

2020-08-16T21:46:06.392Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval "c949965a-67b3-b804-392b-d6b54e419880" JobID: "csi-plugin-gc" Namespace: "-">"

tgross commented 4 years ago

That failed-follow-up eval was the clue I needed! Thanks @alexdulin! It looks like there's a previously unknown but long-standing bug in the way we handle core jobs that have failures, which the earlier bug in CSI has made much more obvious.

Normally when an eval fails more than the delivery limit, the leader will create a new eval with the TriggeredBy field set to failed-follow-up, just as you've seen here. This eval has a long back-off under the assumption that whatever caused the failure was transient.

But evaluations for core jobs carry the leader's ACL, which is not valid on another leader after an election. The failed-follow-up evals do not have ACLs, so core job evals that fail more than the delivery limit or core job evals that span leader elections will never succeed and will be re-enqueued forever. So we should not be retrying core jobs with a failed-follow-up. The core jobs are all on a timer anyways, so they'll get re-done from scratch with the correct ACL.

I've got https://github.com/hashicorp/nomad/pull/8682 open with a failing test that demonstrates the behavior, and I'm working up a fix for it now.

alexdulin commented 4 years ago

That is very interesting and makes sense why these old evaluation have been hanging around for so long. Looking forward to the fix!

tgross commented 4 years ago

This fix for this will ship in the upcoming Nomad 0.12.4

alexdulin commented 4 years ago

Amazing, thank you for fixing this so quickly @tgross. Appreciate the effort.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / nomad