Pod volumeClaim Option for `readOnly: true` creating write volumes

peter-fsdx commented 11 months ago

Bug report

Expected behavior and actual behavior

I am submitting nextflow jobs to be run on a GKE kubernetes cluster. I have a persistent volume I want to mount to every pod created by nextflow. This volume is a GCE persistent disk deployed using google pd.csi.storage.gke.io driver. Per the documentation I have added a pod directive with a volumeClaim with readOnly: true set to my configuration. My expectation was that the volume would be mounted to each pod as a readOnly volume, but instead readOnly is set to false.

This only causes a problem when the volume is mounted on multiple nodes (GCE can only mount a PD with write permissions to one VM at a time). The volume mounts to pods on the first node provisioned. However pods on subsequent nodes cannot access the volume because the mount is in use by the first node. Pods on the first node run but pods on the subsequent nodes are stuck in a ContainerCreating state.

Steps to reproduce the problem

Need a gke cluster with multiple nodes
Need a PV/PVC pair for a Persistent disk

My yaml file

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: references-pv
spec:
  storageClassName: "premium-rwo"
  capacity:
    storage: 1000G
  accessModes:
    - ReadOnlyMany
  claimRef:
    namespace: default
    name: references-pvc
  csi:
    driver: pd.csi.storage.gke.io
    volumeHandle: projects/tall-hippo-17895/zones/us-central1-a/disks/pod-references
    fsType: ext4
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  namespace: default
  name: references-pvc
spec:
  storageClassName: "premium-rwo"
  volumeName: references-pv
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 1000G

Config setting the pod directive. Here is the relevant section from mine

process {
executor = 'k8s'
container = "python:3.10.9-slim-bullseye"

pod = [
    [volumeClaim: 'references-pvc', mountPath: "/references, readOnly: true]
]
}

The workflow does not matter, only that there are two pods one separate nodes. Can be a stub run

Program output

Pods stuck in ContainerCreating state.

kubectl describe po returns this relevant sections where the pod has readOnly set to false and a warning in the events that the disk cannot be attach because it is in use by another machine

Volumes:
  vol-41:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName: references-pvc
    ReadOnly:   false

Events:
  Type     Reason              Age                From                     Message
  ----     ------              ----               ----                     -------
  ...
  Warning  FailedAttachVolume  55s                attachdetach-controller  AttachVolume.Attach failed for volume "references-pv" : rpc error: code = Internal desc = Failed to Attach: failed when waiting for zonal op: rpc er
ror: code = InvalidArgument desc = operation operation-1700574849932-60aa9efd81ea1-060c7b3a-c5cb6cc2 failed (RESOURC
E_IN_USE_BY_ANOTHER_RESOURCE): The disk resource 'projects/tall-hippo-17895/zones/us-central1-a/disks/pod-references' is already being used by 'projects/tall-hippo-17895/zones/us-central1-a/instances/gke-gke-processin
g-cl-gke-16core-pool-1b0cdaa1-d77l'

Environment

Nextflow version: 23.10.0
Java version: openjdk 20-internal 2023-03-21
Operating system: Linux
Bash version: zsh 5.5.1

Additional context

I looked through the nextflow source code and think I see the issue. The readOnly option is implemented for when the pod gets created here. But the option is not handled when the PodOptions are created here. The unit tests reflect this.

bentsherman commented 11 months ago

This looks like a typo:

mountPath: "/references, readOnly: true

Although I feel like that would cause a syntax error. In any case, I can't find any issue with the existing logic. The code you referenced is for the k8s.pod config option but you are using the pod process directive. The logic for readOnly is implemented in PodOptions, PodSpecBuilder and tested here

peter-fsdx commented 11 months ago

Thanks @bentsherman. That is a typo, but one I introduced removing an internal product name from the config. My original config does not have that typo. Just to be sure I introduced this typo and nextflow immediately returned a compilation error and no pods get created on the cluster. Sorry for the messy code issue.

I was under the impression that the K8sConfig is where the pod option model you linked was instantiated, but perhaps I was reaching for a convenient explanation. Any advice on how I can debug further?

bentsherman commented 11 months ago

If you set k8s.debug.yaml = true in your config, it will write the pod yaml for each task to the task directory, something like .command.yaml. Let's try that and see if at least Nextflow is doing the right thing.

peter-fsdx commented 11 months ago

Thank for the suggestion. I am seeing a discrepancy between the command.yaml (volumes are labelled as readOnly: true) and the pod description from the cluster (ReadOnly: false). Below are the outputs for these two things, again slightly sanitized.

.command.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: nf-c981291fa88a8cdec8e993ee0dc5d653
  namespace: default
  labels: {nextflow.io/processName: GENOTYPE_BY_TILE, nextflow.io/runName: friendly_goldstine,
    nextflow.io/sessionId: uuid-60294eb0-35c6-454e-ae27-43ebc94b8aba, nextflow.io/app: nextflow,
    nextflow.io/taskName: GENOTYPE_BY_TILE_23}
spec:
  restartPolicy: Never
  containers:
  - name: nf-c981291fa88a8cdec8e993ee0dc5d653
    image: wave.seqera.io/wt/30afb4f246ff/tall-hippo-17895/fsdx-docker-dev/python:3.10.9-slim-bullseye
    args: [/usr/bin/fusion, bash, /fusion/gs/bucket_name/scratch/nf-scratch/c9/81291fa88a8cdec8e993ee0dc5d653/.command.run]
    securityContext: {privileged: true}
    env:
    - {name: FUSION_WORK, value: /fusion/gs/bucket_name/scratch/nf-scratch/c9/81291fa88a8cdec8e993ee0dc5d653}
    - {name: FUSION_TAGS, value: '[.command.*|.exitcode|.fusion.*](nextflow.io/metadata=true),[*](nextflow.io/temporary=true)'}
    resources:
      requests: {cpu: 1}
    volumeMounts:
    - {name: vol-95, mountPath: /mnt/references, readOnly: true}
    - {name: vol-96, mountPath: /gnomad, readOnly: true}
  nodeSelector: {node.kubernetes.io/instance-type: n2-standard-16}
  serviceAccountName: default
  volumes:
  - name: vol-95
    persistentVolumeClaim: {claimName: references-pvc}
  - name: vol-96
    persistentVolumeClaim: {claimName: gnomad-pvc}

kubectl describe po nf-c981291fa88a8cdec8e993ee0dc5d653:

Volumes:
  vol-95:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  references-pvc
    ReadOnly:   false
  vol-96:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  gnomad-pvc
    ReadOnly:   false

To make sure there isn't somethign funky with my cluster that I missed I made pods with the below set to make sure that this volume in question can be mounted as readOnly. It works as expected

      persistentVolumeClaim:
        claimName: references-pvc
        readOnly: true

bentsherman commented 11 months ago

Okay I see that the readOnly option is defined in both the volume and pod container specs. I'm surprised this hasn't come up before. I guess we can just set it in both places in Nextflow.

If you set readOnly in the volume spec but not the pod container spec, is it applied correctly?

peter-fsdx commented 11 months ago

It appears as though the volume and pos container specs govern different behavior. According to the stackoverflow the different specs govern the disk attachment and disk mounts respectively for the pod. I didn't realize this and had been fixating on the volume spec.

Looking into this I found a gap in my testing (when testing pods outside of nextflow I was not forcing explicitly to different nodes). When I create pods on different nodes with readOnly set to true for both the pod and volume spec outside of nextflow, I am able to recreate the error I am getting above. It's embarrassing but I have no reason to think this is a nextflow issue instead of an infrastructure issue.

Sorry to take your time and thank you for your help

peter-fsdx commented 11 months ago

@bentsherman I can confirm I have set things up incorrectly on my end, and need to reconfigure the infrstructure before I can test this.

It does look like for GCE PD volume mounts specifically both the pod and volume specifications must be readonly https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/readonlymany-disks#use-in-pod

When I have a working test case I will reopen this ticket if needed. Thanks again

bentsherman commented 11 months ago

No problem, even if it was an infra issue on your side, I think Nextflow might need to set readOnly in both places to be safe. The GCE PD is a good motivating example. I will draft a PR for this, then you can use it in your testing if you want.

bentsherman commented 11 months ago

Although, it is possible to mount the same volume multiple times in different ways. If a pod mounts a volume in multiple ways and one of the mounts is read-only, not sure it makes sense to set the volume spec itself to be read-only. Maybe it can only be handled at the infra level.

peter-fsdx commented 11 months ago

So wanted to circle back on this. I fixed the problem with the persistent volumes on my end. The current release of nextflow is behaving as expected. This isn't what I would expect fro the GCE documentation but it looks like there are no issues for my use case.

nextflow-io / nextflow