CSI: validate that single-node mounts aren't used with canaries

f3l1x commented 2 years ago

Nomad version

1.3.1

Operating system and Environment details

Debian 11.3

Issue

Hi ✋

We're running Nomad 1.3.1 with 3 nomad masters, 3 nomad clients, consul, traefik and NFS plugin.

We are facing given allocation in pending state forever. It will end by progress_deadline (10m) and failed deployment.

I am not sure if it's related to CSI, we are using NFS (https://gitlab.com/rocketduck/csi-plugin-nfs). But maybe it's not, sometimes it's happing with CSI and sometimes without.

Reproduction steps

Take a look at job file. If I change only metadata version=10 to version=20, it will stuck. Pending until progress_deadline.

If I change port to static port or dynamic port, it does not matter. Sometimes it surprisely works. :-)

Expected Result

Deployment will success.

Actual Result

Deployment failed. Allocation is still pending.

Job file (if appropriate)

job "canary" {
  type        = "service"
  datacenters = ["dc1"]

  meta {
    version = 10
  }

  update {
    canary       = 1
    max_parallel = 1
    health_check = "checks"
    auto_revert  = true
    auto_promote = true
  }

  group "server" {
    count = 1

    network {
      port "http" { to = 3001 }
    }

    volume "canary-data" {
      type            = "csi"
      source          = "canary-data-volume"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"
    }

    task "echo" {
      driver = "docker"
      config {
        image = "hashicorp/http-echo:latest"
        args  = [
          "-listen", ":${NOMAD_PORT_http}",
          "-text", "Hello world! IP ${NOMAD_IP_http} and PORT ${NOMAD_PORT_http}",
        ]
        ports = ["http"]
      }

      resources {
        cpu    = 128
        memory = 128
      }

      volume_mount {
        volume      = "canary-data"
        destination = "/app/data"
      }

      service {
        port = "http"

        tags = [
          "traefik.enable=true",
          "traefik.http.routers.${NOMAD_JOB_ID}.rule=Host(`canary.domain.tld`)"
        ]

        check {
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

DerekStrickland commented 2 years ago

Hi @f3l1x

Thanks for using Nomad and for reporting this issue. We'll try to replicate this locally and update the issue.

DerekStrickland commented 2 years ago

Hi again @f3l1x

So a couple things stick out to me. Since it sometimes runs and sometimes doesn't it can be really tricky to debug. Can you include your server and client logs ideally both when it works and when it doesn't? Also, if you could include your server and client configs with any secrets removed that would be really helpful. Replicating your environment as best we can is going to be essential.

f3l1x commented 2 years ago

Hi @DerekStrickland. I've verified this job right now and changing only meta of the job 10 times result in same state (pending allocation -> progress_deadline).

With sometimes run I meant that changing different parts of job such as ports or volumes sometimes helps, but it's definitely not the way I would like to use it (changing randomly something). :-) My apologies for mystification.

DerekStrickland commented 2 years ago

Thanks for the clarification 😄

Are you able to share the logs and configuration files with us?

f3l1x commented 2 years ago

I cut the logs around failed deployment. I hope that help. If you need anything more, just say.

Nomad server:

Jun 15 19:37:04 nmdmaster1 nomad[36658]:   |
Jun 15 19:37:04 nmdmaster1 nomad[36658]:   | 404 page not found
Jun 15 19:37:04 nmdmaster1 nomad[36658]:
Jun 15 19:37:34 nmdmaster1 nomad[36658]:     2022-06-15T19:37:34.479+0200 [WARN]  nomad.vault: failed to contact Vault API: retry=30s
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   error=
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   | Error making API request.
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   |
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   | URL: GET https://vault.domain.tld/v1/sys/health?drsecondarycode=123&performancestandbycode=123&sealedcode=123&standbycode=12>
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   | Code: 404. Raw Message:
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   |
Jun 15 19:37:34 nmdmaster1 nomad[36658]:   | 404 page not found
Jun 15 19:37:34 nmdmaster1 nomad[36658]:
Jun 15 19:37:57 nmdmaster1 nomad[36658]:     2022-06-15T19:37:57.585+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
Jun 15 19:37:59 nmdmaster1 nomad[36658]:     2022-06-15T19:37:59.597+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
Jun 15 19:38:03 nmdmaster1 nomad[36658]:     2022-06-15T19:38:03.608+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
Jun 15 19:38:04 nmdmaster1 nomad[36658]:     2022-06-15T19:38:04.491+0200 [WARN]  nomad.vault: failed to contact Vault API: retry=30s
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   error=
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   | Error making API request.
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   |
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   | URL: GET https://vault.domain.tld/v1/sys/health?drsecondarycode=123&performancestandbycode=123&sealedcode=123&standbycode=12>
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   | Code: 404. Raw Message:
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   |
Jun 15 19:38:04 nmdmaster1 nomad[36658]:   | 404 page not found
Jun 15 19:38:04 nmdmaster1 nomad[36658]:
Jun 15 19:38:11 nmdmaster1 nomad[36658]:     2022-06-15T19:38:11.622+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"

Nomad client:

Jun 15 19:05:25 nmd2 nomad[174638]:     2022-06-15T19:05:25.340+0200 [INFO]  client.gc: marking allocation for GC: alloc_id=85b9ec44-060c-712e-6548-c2064d9437f2
Jun 15 19:05:25 nmd2 nomad[174638]:     2022-06-15T19:05:25.343+0200 [INFO]  agent: (runner) stopping
Jun 15 19:05:25 nmd2 nomad[174638]:     2022-06-15T19:05:25.343+0200 [INFO]  agent: (runner) received finish
Jun 15 19:37:57 nmd2 nomad[174638]:     2022-06-15T19:37:57.587+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:37:57 nmd2 nomad[174638]:     2022-06-15T19:37:57.587+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:37:59 nmd2 nomad[174638]:     2022-06-15T19:37:59.598+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=B.B.B.B:4647
Jun 15 19:37:59 nmd2 nomad[174638]:     2022-06-15T19:37:59.598+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=B.B.B.B:4647
Jun 15 19:38:03 nmd2 nomad[174638]:     2022-06-15T19:38:03.609+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=C.C.C.C:4647
Jun 15 19:38:03 nmd2 nomad[174638]:     2022-06-15T19:38:03.609+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=C.C.C.C:4647
Jun 15 19:38:11 nmd2 nomad[174638]:     2022-06-15T19:38:11.623+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:38:11 nmd2 nomad[174638]:     2022-06-15T19:38:11.623+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:38:27 nmd2 nomad[174638]:     2022-06-15T19:38:27.635+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=B.B.B.B:4647
Jun 15 19:38:27 nmd2 nomad[174638]:     2022-06-15T19:38:27.635+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=B.B.B.B:4647
Jun 15 19:38:59 nmd2 nomad[174638]:     2022-06-15T19:38:59.645+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=C.C.C.C:4647
Jun 15 19:38:59 nmd2 nomad[174638]:     2022-06-15T19:38:59.645+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=C.C.C.C:4647
Jun 15 19:39:59 nmd2 nomad[174638]:     2022-06-15T19:39:59.655+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647
Jun 15 19:39:59 nmd2 nomad[174638]:     2022-06-15T19:39:59.656+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=A.A.A.A:4647

f3l1x commented 2 years ago

@DerekStrickland Just an idea, but is it possible that we're facing this trouble because we use single-node-writer in CSI? And multiple canaries instances requires two allocations at the same time before first one gets off?

DerekStrickland commented 2 years ago

That's a really interesting theory. The "max claims reached" log message might point to just that. I'm cc'ing my colleague @tgross for a consultation 😄

tgross commented 2 years ago

Hi folks. Yes, single-node-writer isn't going to be compatible with canaries at all. The canary instance needs to mount the volume in order to claim it, which we can't support for a volume that's only allowed to have one allocation mounting it.

We already validate that the per_alloc flag isn't compatible with canaries and reject it at the time of job submission. We should probably tweak this so that it's checked for all volumes that are single-node-* mounts instead. I'm going to reword the title of this issue and mark it as a roadmap item. Thanks!

f3l1x commented 2 years ago

Hi @tgross, thank you.

Can you please clarify what is the correct usage of multi-node-single-writer? I understand the rest, but this I don't know where to use it.

tgross commented 2 years ago

Can you please clarify what is the correct usage of multi-node-single-writer? I understand the rest, but this I don't know where to use it.

That's a volume that can accept multiple readers but only a single writer. As noted in the access_mode docs, support for this is controlled by the storage provider (and CSI plugin). I'm going to be honest and say I'm not sure I know of any examples of storage providers where that option is available. Most support only single-node-* and the ones that support multi-node usually support multiple writes as well (ex. NFS).

scaleoutsean commented 2 years ago

Yes, single-node-writer isn't going to be compatible with canaries at all. The canary instance needs to mount the volume in order to claim it, which we can't support for a volume that's only allowed to have one allocation mounting it.

AFAIK CSI SINGLE_NODE_WRITER allows multiple allocations on the same host (node).

     // Can only be published once as read/write on a single node, at
      // any given time.
      SINGLE_NODE_WRITER = 1;

That means if a bunch of workloads are dispatched to the same Nomad client, that should work for SINGLE_NODE_WRITER (RWO) volume - they're still single writer from the host perspective, not from the process perspective.

tgross commented 2 years ago

Hi @scaleoutsean. That's not my reading of the spec: "published" is a different state than "staged", and publishing includes the association with a specific workload.

scaleoutsean commented 2 years ago

@tgross okay, then - even though our opinions differ, it's useful to know how the spec is understood by Nomad.

If you're willing to entertain the possibility of the spec being wrong or not clear: yes, publishing is different, but as you said that's an association with a workload (not covered by the spec) whereas SINGLE vs MULTI describes the number of worker nodes where that may happen (covered by the spec).

If a volume has a single host FS with just one file, write.txt, and is published twice to two workloads running on the same node (i.e. SINGLE_NODE writer), where one workload consists of echo $(date) >> write.txt while the other does tail -f write.txt, it's easy to see there's nothing wrong with that. Or in a milder version where both workloads just serve the same static Web site.

In fact each pod could even write to a different file. Imagine two (stand-alone) MiniIO containers allowing upload to the same filesystem (last writer wins), while reads would be parallel. And this also wouldn't be a MULTI-node setup and should be allowed by Nomad, IMO. What wouldn't work is if one of the pods died and Nomad tried to reschedule it on another worker, at which point the volume couldn't be published the second time, because that would become a MULTI-node situation.

And even in a "worst case" scenario where multiple workloads write to the same file that's workable as well, as long as workloads are smart (lock a byte range for modifications, or lazily obtain a write lock only when writing). That's no different on how it works on a Linux VM where multiple applications log to the same file without there being a MULTI writer-capable filesystem (cluster file system or NFS) underneath.

It's beneficial to Nomad users if they can schedule multiple workloads that use the same volume on a host. If you have a single-host filesystem, you can't work on it in parallel (if the spec is understood to mean "single workload") although the host may have plentiful resources that can allow parallel execution (e.g. parametrized batch jobs). The second workload that tries to obtain exclusive lock on a file already locked by the first workload couldn't start, but then this is what's expected and consistent with VM or bare metal environments - if one tries to start two PostgreSQL instances using the same data and log files and that doesn't work, they probably won't attempt to argue it's a PostgreSQL bug.

Related to this issue, I haven't looked at how the provisioner used by OP works, but "sometimes it's happing [sic!]" indicates there's no problem; if you get lucky and the second pod that uses the volume gets scheduled on the same worker where the existing workload is - it'll work.

tgross commented 2 years ago

If you're willing to entertain the possibility of the spec being wrong or not clear:

In my experience that's, uh, definitely a possibility. 😀 So we're absolutely open to discussing it. As it turns out, there's an open issue in the spec repo that covers exactly this case: https://github.com/container-storage-interface/spec/issues/178 which suggests that you're not alone in wanting this.

And even in a "worst case" scenario where multiple workloads write to the same file that's workable as well, as long as workloads are smart (lock a byte range for modifications, or lazily obtain a write lock only when writing). That's no different on how it works on a Linux VM where multiple applications log to the same file without there being a MULTI writer-capable filesystem (cluster file system or NFS) underneath.

Totally agreed that the application could own the "who's writing?" semantics. The application developer knows way more about the usage pattern than the orchestrator (Nomad in this case) can possibly know. But there's benefit in our being conservative here and protecting users from corrupting their data by imposing the requirement that their application be aware of these semantics. And I think that's what makes it a harder sell for us.

That being said, if https://github.com/container-storage-interface/spec/issues/178 ends up getting resolved, we'll most likely end up needing to support that approach anyways. We'll likely need to get around to fixing https://github.com/hashicorp/nomad/issues/11798 as well.

Related to this issue, I haven't looked at how the provisioner used by OP works, but "sometimes it's happing [sic!]" indicates there's no problem; if you get lucky and the second pod that uses the volume gets scheduled on the same worker where the existing workload is - it'll work.

Yeah, something like the distinct_hosts field seems like it should help here, but I'm not sure off the top of my head that it works across job versions, I'd have to dig into that.

scaleoutsean commented 2 years ago

@tgross - for reference only, K8s seems to have implemeneted a "workaround" for this outside of CSI, with ReadWriteOncePod. A separate enhancement request/issue can be created if we wanted something similar on Nomad.

tgross commented 2 years ago

@scaleoutsean thanks for that reference. It looks like the k8s folks have at least at some point nudged the spec folks about this very issue: https://github.com/container-storage-interface/spec/issues/465#issuecomment-739046063. Our team is currently focused on getting Nomad 1.4.0 out the door, but I think this is worth having further discussion about here once we've got some breathing room.

hashicorp / nomad