Open f3l1x opened 2 years ago
Hi @f3l1x
Thanks for using Nomad and for reporting this issue. We'll try to replicate this locally and update the issue.
Hi again @f3l1x
So a couple things stick out to me. Since it sometimes runs and sometimes doesn't it can be really tricky to debug. Can you include your server and client logs ideally both when it works and when it doesn't? Also, if you could include your server and client configs with any secrets removed that would be really helpful. Replicating your environment as best we can is going to be essential.
Hi @DerekStrickland. I've verified this job right now and changing only meta of the job 10 times result in same state (pending allocation -> progress_deadline).
With sometimes run
I meant that changing different parts of job such as ports or volumes sometimes helps, but it's definitely not the way I would like to use it (changing randomly something). :-) My apologies for mystification.
Thanks for the clarification 😄
Are you able to share the logs and configuration files with us?
I cut the logs around failed deployment. I hope that help. If you need anything more, just say.
Nomad server:
Jun 15 19:37:04 nmdmaster1 nomad[36658]: |
Jun 15 19:37:04 nmdmaster1 nomad[36658]: | 404 page not found
Jun 15 19:37:04 nmdmaster1 nomad[36658]:
Jun 15 19:37:34 nmdmaster1 nomad[36658]: 2022-06-15T19:37:34.479+0200 [WARN] nomad.vault: failed to contact Vault API: retry=30s
Jun 15 19:37:34 nmdmaster1 nomad[36658]: error=
Jun 15 19:37:34 nmdmaster1 nomad[36658]: | Error making API request.
Jun 15 19:37:34 nmdmaster1 nomad[36658]: |
Jun 15 19:37:34 nmdmaster1 nomad[36658]: | URL: GET https://vault.domain.tld/v1/sys/health?drsecondarycode=123&performancestandbycode=123&sealedcode=123&standbycode=12>
Jun 15 19:37:34 nmdmaster1 nomad[36658]: | Code: 404. Raw Message:
Jun 15 19:37:34 nmdmaster1 nomad[36658]: |
Jun 15 19:37:34 nmdmaster1 nomad[36658]: | 404 page not found
Jun 15 19:37:34 nmdmaster1 nomad[36658]:
Jun 15 19:37:57 nmdmaster1 nomad[36658]: 2022-06-15T19:37:57.585+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
Jun 15 19:37:59 nmdmaster1 nomad[36658]: 2022-06-15T19:37:59.597+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
Jun 15 19:38:03 nmdmaster1 nomad[36658]: 2022-06-15T19:38:03.608+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
Jun 15 19:38:04 nmdmaster1 nomad[36658]: 2022-06-15T19:38:04.491+0200 [WARN] nomad.vault: failed to contact Vault API: retry=30s
Jun 15 19:38:04 nmdmaster1 nomad[36658]: error=
Jun 15 19:38:04 nmdmaster1 nomad[36658]: | Error making API request.
Jun 15 19:38:04 nmdmaster1 nomad[36658]: |
Jun 15 19:38:04 nmdmaster1 nomad[36658]: | URL: GET https://vault.domain.tld/v1/sys/health?drsecondarycode=123&performancestandbycode=123&sealedcode=123&standbycode=12>
Jun 15 19:38:04 nmdmaster1 nomad[36658]: | Code: 404. Raw Message:
Jun 15 19:38:04 nmdmaster1 nomad[36658]: |
Jun 15 19:38:04 nmdmaster1 nomad[36658]: | 404 page not found
Jun 15 19:38:04 nmdmaster1 nomad[36658]:
Jun 15 19:38:11 nmdmaster1 nomad[36658]: 2022-06-15T19:38:11.622+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
Nomad client:
Jun 15 19:05:25 nmd2 nomad[174638]: 2022-06-15T19:05:25.340+0200 [INFO] client.gc: marking allocation for GC: alloc_id=85b9ec44-060c-712e-6548-c2064d9437f2
Jun 15 19:05:25 nmd2 nomad[174638]: 2022-06-15T19:05:25.343+0200 [INFO] agent: (runner) stopping
Jun 15 19:05:25 nmd2 nomad[174638]: 2022-06-15T19:05:25.343+0200 [INFO] agent: (runner) received finish
Jun 15 19:37:57 nmd2 nomad[174638]: 2022-06-15T19:37:57.587+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:37:57 nmd2 nomad[174638]: 2022-06-15T19:37:57.587+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:37:59 nmd2 nomad[174638]: 2022-06-15T19:37:59.598+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=B.B.B.B:4647
Jun 15 19:37:59 nmd2 nomad[174638]: 2022-06-15T19:37:59.598+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=B.B.B.B:4647
Jun 15 19:38:03 nmd2 nomad[174638]: 2022-06-15T19:38:03.609+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=C.C.C.C:4647
Jun 15 19:38:03 nmd2 nomad[174638]: 2022-06-15T19:38:03.609+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=C.C.C.C:4647
Jun 15 19:38:11 nmd2 nomad[174638]: 2022-06-15T19:38:11.623+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:38:11 nmd2 nomad[174638]: 2022-06-15T19:38:11.623+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:38:27 nmd2 nomad[174638]: 2022-06-15T19:38:27.635+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=B.B.B.B:4647
Jun 15 19:38:27 nmd2 nomad[174638]: 2022-06-15T19:38:27.635+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=B.B.B.B:4647
Jun 15 19:38:59 nmd2 nomad[174638]: 2022-06-15T19:38:59.645+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=C.C.C.C:4647
Jun 15 19:38:59 nmd2 nomad[174638]: 2022-06-15T19:38:59.645+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=C.C.C.C:4647
Jun 15 19:39:59 nmd2 nomad[174638]: 2022-06-15T19:39:59.655+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:39:59 nmd2 nomad[174638]: 2022-06-15T19:39:59.656+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
@DerekStrickland Just an idea, but is it possible that we're facing this trouble because we use single-node-writer
in CSI? And multiple canaries instances requires two allocations at the same time before first one gets off?
That's a really interesting theory. The "max claims reached" log message might point to just that. I'm cc'ing my colleague @tgross for a consultation 😄
Hi folks. Yes, single-node-writer
isn't going to be compatible with canaries at all. The canary instance needs to mount the volume in order to claim it, which we can't support for a volume that's only allowed to have one allocation mounting it.
We already validate that the per_alloc
flag isn't compatible with canaries and reject it at the time of job submission. We should probably tweak this so that it's checked for all volumes that are single-node-*
mounts instead. I'm going to reword the title of this issue and mark it as a roadmap item. Thanks!
Hi @tgross, thank you.
Can you please clarify what is the correct usage of multi-node-single-writer
? I understand the rest, but this I don't know where to use it.
Can you please clarify what is the correct usage of multi-node-single-writer? I understand the rest, but this I don't know where to use it.
That's a volume that can accept multiple readers but only a single writer. As noted in the access_mode
docs, support for this is controlled by the storage provider (and CSI plugin). I'm going to be honest and say I'm not sure I know of any examples of storage providers where that option is available. Most support only single-node-*
and the ones that support multi-node usually support multiple writes as well (ex. NFS).
Yes,
single-node-writer
isn't going to be compatible with canaries at all. The canary instance needs to mount the volume in order to claim it, which we can't support for a volume that's only allowed to have one allocation mounting it.
AFAIK CSI SINGLE_NODE_WRITER
allows multiple allocations on the same host (node).
// Can only be published once as read/write on a single node, at
// any given time.
SINGLE_NODE_WRITER = 1;
That means if a bunch of workloads are dispatched to the same Nomad client, that should work for SINGLE_NODE_WRITER (RWO) volume - they're still single writer from the host perspective, not from the process perspective.
Hi @scaleoutsean. That's not my reading of the spec: "published" is a different state than "staged", and publishing includes the association with a specific workload.
@tgross okay, then - even though our opinions differ, it's useful to know how the spec is understood by Nomad.
If you're willing to entertain the possibility of the spec being wrong or not clear: yes, publishing is different, but as you said that's an association with a workload (not covered by the spec) whereas SINGLE
vs MULTI
describes the number of worker nodes where that may happen (covered by the spec).
If a volume has a single host FS with just one file, write.txt, and is published twice to two workloads running on the same node (i.e. SINGLE_NODE
writer), where one workload consists of echo $(date) >> write.txt
while the other does tail -f write.txt
, it's easy to see there's nothing wrong with that. Or in a milder version where both workloads just serve the same static Web site.
In fact each pod could even write to a different file. Imagine two (stand-alone) MiniIO containers allowing upload to the same filesystem (last writer wins), while reads would be parallel. And this also wouldn't be a MULTI
-node setup and should be allowed by Nomad, IMO. What wouldn't work is if one of the pods died and Nomad tried to reschedule it on another worker, at which point the volume couldn't be published the second time, because that would become a MULTI
-node situation.
And even in a "worst case" scenario where multiple workloads write to the same file that's workable as well, as long as workloads are smart (lock a byte range for modifications, or lazily obtain a write lock only when writing). That's no different on how it works on a Linux VM where multiple applications log to the same file without there being a MULTI
writer-capable filesystem (cluster file system or NFS) underneath.
It's beneficial to Nomad users if they can schedule multiple workloads that use the same volume on a host. If you have a single-host filesystem, you can't work on it in parallel (if the spec is understood to mean "single workload") although the host may have plentiful resources that can allow parallel execution (e.g. parametrized batch jobs). The second workload that tries to obtain exclusive lock on a file already locked by the first workload couldn't start, but then this is what's expected and consistent with VM or bare metal environments - if one tries to start two PostgreSQL instances using the same data and log files and that doesn't work, they probably won't attempt to argue it's a PostgreSQL bug.
Related to this issue, I haven't looked at how the provisioner used by OP works, but "sometimes it's happing [sic!]" indicates there's no problem; if you get lucky and the second pod that uses the volume gets scheduled on the same worker where the existing workload is - it'll work.
If you're willing to entertain the possibility of the spec being wrong or not clear:
In my experience that's, uh, definitely a possibility. 😀 So we're absolutely open to discussing it. As it turns out, there's an open issue in the spec repo that covers exactly this case: https://github.com/container-storage-interface/spec/issues/178 which suggests that you're not alone in wanting this.
And even in a "worst case" scenario where multiple workloads write to the same file that's workable as well, as long as workloads are smart (lock a byte range for modifications, or lazily obtain a write lock only when writing). That's no different on how it works on a Linux VM where multiple applications log to the same file without there being a MULTI writer-capable filesystem (cluster file system or NFS) underneath.
Totally agreed that the application could own the "who's writing?" semantics. The application developer knows way more about the usage pattern than the orchestrator (Nomad in this case) can possibly know. But there's benefit in our being conservative here and protecting users from corrupting their data by imposing the requirement that their application be aware of these semantics. And I think that's what makes it a harder sell for us.
That being said, if https://github.com/container-storage-interface/spec/issues/178 ends up getting resolved, we'll most likely end up needing to support that approach anyways. We'll likely need to get around to fixing https://github.com/hashicorp/nomad/issues/11798 as well.
Related to this issue, I haven't looked at how the provisioner used by OP works, but "sometimes it's happing [sic!]" indicates there's no problem; if you get lucky and the second pod that uses the volume gets scheduled on the same worker where the existing workload is - it'll work.
Yeah, something like the distinct_hosts
field seems like it should help here, but I'm not sure off the top of my head that it works across job versions, I'd have to dig into that.
@tgross - for reference only, K8s seems to have implemeneted a "workaround" for this outside of CSI, with ReadWriteOncePod
. A separate enhancement request/issue can be created if we wanted something similar on Nomad.
@scaleoutsean thanks for that reference. It looks like the k8s folks have at least at some point nudged the spec folks about this very issue: https://github.com/container-storage-interface/spec/issues/465#issuecomment-739046063. Our team is currently focused on getting Nomad 1.4.0 out the door, but I think this is worth having further discussion about here once we've got some breathing room.
Nomad version
1.3.1
Operating system and Environment details
Debian 11.3
Issue
Hi ✋
We're running Nomad 1.3.1 with 3 nomad masters, 3 nomad clients, consul, traefik and NFS plugin.
We are facing given allocation in pending state forever. It will end by progress_deadline (10m) and failed deployment.
I am not sure if it's related to CSI, we are using NFS (https://gitlab.com/rocketduck/csi-plugin-nfs). But maybe it's not, sometimes it's happing with CSI and sometimes without.
Reproduction steps
Take a look at job file. If I change only metadata
version=10
toversion=20
, it will stuck. Pending until progress_deadline.If I change port to static port or dynamic port, it does not matter. Sometimes it surprisely works. :-)
Expected Result
Deployment will success.
Actual Result
Deployment failed. Allocation is still pending.
Job file (if appropriate)