Closed alexdulin closed 4 years ago
Hi @alexdulin!
We have been observing these errors in the logs since 0.11.0 and I was hoping that the fixes to CSI issues in 0.12.2 might fix them
Just for clarity, are these batch jobs ones that had CSI volumes attached?
No they are not. We have not had CSI volumes/plugins registered for a long time (since it was initially released). After trying out CSI volumes we started to receive many errors that only finally stopped after the 0.12.2 release (thank you very much for that by the way), but those were different than what I posted above.
Ok, thanks. We'll look into the ACL errors you're seeing.
I just took a quick pass through the section of code involved. The GC job passes the leader's ACL token to the batch deregister RPC.
Resolving the token can't raise a "permission denied", only a "token not found" (and even that can't be hit if we have the legit leader ACL token as expected). So it looks the permission denied error is bubbling up from (*Job) BatchDeregister
, where we're checking the submit job permissions for the namespaces.
Is this cluster running Nomad Enterprise by chance? Or if not, did this cluster used to be running Nomad Enterprise?
No, its running the open source version. Never was enterprise.
Understood. The log entry says it's trying to reap 6 jobs:
core.sched: job GC found eligible objects: jobs=6 evals=0 allocs=0
Give that you said this has been going on for a while, do you have a particular set of jobs that you'd expect to have been garbage collected but aren't getting reaped? If so, would you be willing to share the results of the Job API against some of those jobs? It might help to reproduce the issue at our end.
At one point I tried to fix this by deleting all of our batch jobs, waiting for them to be fully GC'ed (waited 72 hours), and then seeing if the logs continued, and they did. Since then we have re-enabled the jobs, but there is no specific set that would strike me as causing this.
Our of curiosity, do forced garbage collection get considered as a batch job? The warnings that are being logged by the raft leader seem to indicate that forced GCs are at least a job in terms of an evaluation:
2020-08-12T18:40:56.433Z [WARN] nomad: eval reached delivery limit, marking as failed: eval="<Eval "28722e8d-c558-df05-c76d-f5edc762ea2b" JobID: "force-gc" Namespace: "-">"
When we were having the initial CSI errors, GCs were failing due to volumes still be associated with allocations (nodes failed to drain too, which has been documented in other issues). Is it possible these six "batch jobs" are six failed force GCs that are stuck in the bolt database or something?
Our of curiosity, do forced garbage collection get considered as a batch job?
That's a good insight! Almost: Nomad runs GC tasks as "core" jobs, which are similar to batch jobs but their evaluations result in internal functions being executed instead of running workloads on clients. But yes they do result in evaluations we have to process.
So from the logs what I'm seeing is the following:
force-gc
,(*CoreScheduler) jobGC()
function,Job.BatchDeregister
RPC,ErrPermissionDenied
error,The log entry there is maybe a little misleading in wording. The thing that failed is a "reaping a batch of jobs", not "reaping a batch job".
The ACL the core jobs get is a "management" token. I'll see if I can get the ACLs into a state where the management token is rejected for some reason.
This might be unrelated to the current issue, but after upgrading to 0.12.3 we are now seeing CSI errors in the logs again:
2020-08-16T21:29:55.601Z [DEBUG] worker: dequeued evaluation: eval_id=99cec766-26ad-b0cb-2094-082344eff89c
2020-08-16T21:29:55.601Z [DEBUG] core.sched: CSI plugin GC scanning before cutoff index: index=6553862 csi_plugin_gc_threshold=1h0m0s
2020-08-16T21:29:55.601Z [ERROR] core.sched: failed to GC plugin: plugin_id=[REDACTED] error="rpc error: Permission denied"
2020-08-16T21:29:55.601Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: rpc error: Permission denied"
2020-08-16T21:29:55.602Z [DEBUG] worker: nack evaluation: eval_id=99cec766-26ad-b0cb-2094-082344eff89c
2020-08-16T21:29:55.608Z [DEBUG] worker: dequeued evaluation: eval_id=4b006614-fc24-1e9b-f943-ed9403e8579b
2020-08-16T21:29:55.948Z [DEBUG] core.sched: CSI plugin GC scanning before cutoff index: index=6553862 csi_plugin_gc_threshold=1h0m0s
2020-08-16T21:29:55.949Z [ERROR] core.sched: failed to GC plugin: plugin_id=[REDACTED] error="rpc error: Permission denied"
2020-08-16T21:29:55.949Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: rpc error: Permission denied"
2020-08-16T21:29:55.950Z [DEBUG] worker: nack evaluation: eval_id=4b006614-fc24-1e9b-f943-ed9403e8579b
2020-08-16T21:29:56.953Z [DEBUG] worker: dequeued evaluation: eval_id=4b006614-fc24-1e9b-f943-ed9403e8579b
2020-08-16T21:29:56.953Z [DEBUG] core.sched: CSI plugin GC scanning before cutoff index: index=6553862 csi_plugin_gc_threshold=1h0m0s
2020-08-16T21:29:56.953Z [ERROR] core.sched: failed to GC plugin: plugin_id=[REDACTED] error="rpc error: Permission denied"
2020-08-16T21:29:56.954Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: rpc error: Permission denied"
2020-08-16T21:29:56.954Z [DEBUG] worker: nack evaluation: eval_id=4b006614-fc24-1e9b-f943-ed9403e8579b
That is from a follower server. We do not have any CSI plugins running or any CSI volumes registered, but we do still have many old CSI related evaluations left over from when we first tried it out, like the following:
{
"ID": "ff91f2c4-374e-f2f6-81a0-9b8e69543875",
"Namespace": "default",
"Priority": 200,
"Type": "_core",
"TriggeredBy": "failed-follow-up",
"JobID": "csi-volume-claim-gc:[REDACTED]:no",
"Status": "pending",
"Wait": 341929846579,
"PreviousEval": "a708d4e5-0a22-ab2c-d494-18d70209131f",
"CreateIndex": 6497734,
"ModifyIndex": 6497734,
"CreateTime": 1597254886540115403,
"ModifyTime": 1597254886540115403
}
And after upgrading the rest of the cluster to 0.12.3, once a new leader was elected this started to show up in its logs:
2020-08-16T21:46:06.392Z [WARN] nomad: eval reached delivery limit, marking as failed: eval="<Eval "c949965a-67b3-b804-392b-d6b54e419880" JobID: "csi-plugin-gc" Namespace: "-">"
That failed-follow-up
eval was the clue I needed! Thanks @alexdulin! It looks like there's a previously unknown but long-standing bug in the way we handle core jobs that have failures, which the earlier bug in CSI has made much more obvious.
Normally when an eval fails more than the delivery limit, the leader will create a new eval with the TriggeredBy
field set to failed-follow-up
, just as you've seen here. This eval has a long back-off under the assumption that whatever caused the failure was transient.
But evaluations for core jobs carry the leader's ACL, which is not valid on another leader after an election. The failed-follow-up
evals do not have ACLs, so core job evals that fail more than the delivery limit or core job evals that span leader elections will never succeed and will be re-enqueued forever. So we should not be retrying core jobs with a failed-follow-up. The core jobs are all on a timer anyways, so they'll get re-done from scratch with the correct ACL.
I've got https://github.com/hashicorp/nomad/pull/8682 open with a failing test that demonstrates the behavior, and I'm working up a fix for it now.
That is very interesting and makes sense why these old evaluation have been hanging around for so long. Looking forward to the fix!
This fix for this will ship in the upcoming Nomad 0.12.4
Amazing, thank you for fixing this so quickly @tgross. Appreciate the effort.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Nomad v0.12.2 (ee69b3379aeced67e14943b86c4f621451e64e84)
Operating system and Environment details
Linux ip-172-31-63-158 4.15.0-1079-aws #83~16.04.1-Ubuntu SMP Tue Jul 14 15:31:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Issue
Nomad servers are failing to reap jobs due to RPC errors. We have been observing these errors in the logs since 0.11.0 and I was hoping that the fixes to CSI issues in 0.12.2 might fix them, but it does not appear that is the case. It is not clear that the warnings from the leader's logs are related to what is showing in the follower servers, but they occur at the same frequency and have not stopped in months.
The cluster has full ACLs, TLS, and RPC encryption in place.
Nomad Server logs (if appropriate)
Logs from follower servers
Logs from leader