Open vsoch opened 1 year ago
That sounds like an issue with implicit directories:
Oh, great! I'll read up and test this out again tomorrow - will post an update (and hopefully be able to close the issue). Thank you!
Okay this is great - making progress! I changed the working directory to be exactly where the workflow is, and then when I do a listing I see the contents!
The working directory is /workflow/snakemake-workflow, contents include:
Dockerfile README.md Snakefile environment.yaml
And then I got a permissions error (still progress!):
Traceback (most recent call last):
File "/opt/micromamba/envs/snakemake/bin/snakemake", line 10, in <module>
sys.exit(main())
File "/opt/micromamba/envs/snakemake/lib/python3.10/site-packages/snakemake/__init__.py", line 2945, in main
success = snakemake(
File "/opt/micromamba/envs/snakemake/lib/python3.10/site-packages/snakemake/__init__.py", line 563, in snakemake
logger.setup_logfile()
File "/opt/micromamba/envs/snakemake/lib/python3.10/site-packages/snakemake/logging.py", line 307, in setup_logfile
os.makedirs(os.path.join(".snakemake", "log"), exist_ok=True)
File "/opt/micromamba/envs/snakemake/lib/python3.10/os.py", line 215, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/opt/micromamba/envs/snakemake/lib/python3.10/os.py", line 225, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '.snakemake'
Snakemake is trying to write a directory .snakemake
in that present working directory.
I tried setting the file/directory mode so anyone could read write, and that didn't work (same error) e.g.,:
# Read/write access for all users
gcs.csi.ofek.dev/dir-mode: "0777"
gcs.csi.ofek.dev/file-mode: "0777"
It looks like root has a strange id (the default that the storage uses)
🔒️ Working directory permissions:
total 3
-rw-rw-r-- 1 root 63147 233 Feb 10 22:57 Dockerfile
-rw-rw-r-- 1 root 63147 347 Feb 10 22:57 README.md
-rw-rw-r-- 1 root 63147 1144 Feb 10 22:57 Snakefile
-rw-rw-r-- 1 root 63147 203 Feb 10 22:57 environment.yaml
Although when I tried to change that to 0 or the user id, the mount didn't work, period, so I won't mess with that for now. So I double checked the user that needs to run the workflow:
uid=1000(flux) gid=1000(flux) groups=1000(flux)
And then tried:
...
gcs.csi.ofek.dev/gid: "1000"
gcs.csi.ofek.dev/uid: "1000"
gcs.csi.ofek.dev/dir-mode: "0755"
gcs.csi.ofek.dev/file-mode: "0664"
And then based on this issue I decided to try adding the implicit-dirs flag:
gcs.csi.ofek.dev/gid: "1000"
gcs.csi.ofek.dev/uid: "1000"
gcs.csi.ofek.dev/dir-mode: "0755"
gcs.csi.ofek.dev/file-mode: "0664"
implicit-dirs: "true"
Neither of those worked - I don't think I'm allowed to change the gid/uid because then the pvc stops working?
Warning ProvisioningFailed 47s gcs.csi.ofek.dev_gke-flux-cluster-default-pool-3f21ee47-pt36_9451e781-d56c-4169-8591-879acc52e19f failed to provision volume with StorageClass "csi-gcs": rpc error: code = Internal desc = Failed to set bucket capacity: googleapi: Error 403: Access denied., forbidden
Do you have a suggestion for what I should try? In a nutshell, the container starts as root, and we do that for setup of things. THe working directory of the run is the mounted directory. When the workflow is run, it's done by a "flux" user (on behalf by root). So I assume what is happening is that flux doesn't have permission to write there, but I don't totally understand why, because if I set permissions to 0777 for file/directory I'd expect anyone could write there.
Also heads up the "mount options" for fuse at this link is 404 https://ofek.dev/csi-gcs/dynamic_provisioning/#extra-flags.
Update: opened a PR with a quick fix https://github.com/ofek/csi-gcs/pull/156
And I really love being able to define these as annotations! At least for my operator, the user is in control of annotations (in the custom resource definition) and it's nice I don't have to edit / redeploy my operator every time to try something new.
Update: also tried derivations of:
gcs.csi.ofek.dev/fuse-mount-options: "rw,allow_other,file_mode=777,dir_mode=777"
gcs.csi.ofek.dev/fuse-mount-options: "rw,allow_other,file_mode=777,dir_mode=777,uid=1000,gid=1000"
No luck yet, going to bring the cluster down for today and looking forward to hearing your feedback!
I tried running the workflow as root, and it looks like the permissions issue is gone, but it doesn't see any of the data in the subdirectories (nor does it see the subdirectories). I tried doing an "ls" so it would show up, and I also added impicit-dirs to be true, neither made a difference.
broker.info[0]: quorum-full: quorum->run 0.430339s
Building DAG of jobs...
MissingInputException in rule bwa_map in file /workflow/snakemake-workflow/Snakefile, line 9:
Missing input files for rule bwa_map:
output: mapped_reads/A.bam
wildcards: sample=A
affected files:
data/samples/A.fastq
data/genome.fa
Did you try setting the fsGroup
? https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-the-security-context-for-a-pod
Interesting - I can try that for the latter case (running as root) but I'm afraid if I change it to the flux user, root will no longer be able to write files to the config map locations (root sets things up for the workflow).
Still no go - I've tried both derivations of having things owned by the flux user and root, and the closest I can get is to have root own / run everything,
🔒️ Working directory permissions:
total 3
-rw-rw-r-- 1 root 63147 233 Feb 10 22:57 Dockerfile
-rw-rw-r-- 1 root 63147 347 Feb 10 22:57 README.md
-rw-rw-r-- 1 root 63147 1144 Feb 10 22:57 Snakefile
-rw-rw-r-- 1 root 63147 203 Feb 10 22:57 environment.yaml
but I'm not actually able to see the subdirectory, it's like it doesn't exist. So the workflow fails.
broker.info[0]: quorum-full: quorum->run 9.18322s
Building DAG of jobs...
MissingInputException in rule bwa_map in file /workflow/snakemake-workflow/Snakefile, line 9:
Missing input files for rule bwa_map:
output: mapped_reads/A.bam
wildcards: sample=A
affected files:
data/samples/A.fastq
data/genome.fa
broker.err[0]: rc2.0: flux mini submit -n 1 --quiet --watch snakemake --cores 1 --flux Exited (rc=1) 1.8s
Where can I ask for more help on this?
Hiya! I have been trying this a few days, and reached a point I thought I'd ask for help. I basically have an operator that is setting up this driver to mount to an existing Google Storage bucket, and everything seems to be working, but when I list the content of the directory (that should be bound) I don't see anything in the storage. I'll try to walk through what I can see carefully so you can help (and maybe this will help me to debug a bit too!).
Bucket
I have files for a Snakemake workflow in the root of a bucket in a subdirectory - I'm assuming that mounting the root of this bucket would allow me to see the subdirectory too? E.g.,
and in that directory:
Although that's probably not important yet because I can't ls at the root to see the subdirectory. I am wondering if permissions have something to do with it - e.g., I see these options:
But I haven't done something like make everything public because the service account associated with the secret I have given Storage Admin and Storage Object Admin roles. Okay - so that's the storage bucket!
Secret
I created the service account with the above permissions, and followed instructions to generate the secret, e.g., a derivative of
One thing that I wasn't sure about in the instructions is when it says:
I added this as one of the roles:
but I'm not sure what encryption key this is talking about (and maybe this is the bug?) I couldn't figure out what else I was supposed to do from the getting started guide.
PVC and PV
My PVC and PV look okay? Here are the configs - these are created in Go and I'm outputting the kubectl output in yaml, so some of the settings here are defaults.
What sticks out to me as maybe erroneous is that although I made the capacity 25Gi, the spec resource -> requests is for 1Ki?
I'm actually a bit confused about this resource request, because in my code I set this to the same value as the capacity above, which should be 25:
If that is somehow not being set - where do I set it? Is there an annotation I should be using, and regardless, could that be the bug that the resource request is too small?
For my PV, it also looks OK:
Note that a
MiniCluster
is a CRD with an indexed job, a few config maps, etc. It's what creates the indexed job. Should that parent attribute be something else?I can also shell into one of the worker containers (that doesn't exit and fail because it's reliant on the main broker in the indexed job) and I see the volume at
/workflow
but it's empty.And here they are listed:
And what a pod (for the indexed job) sees:
Note that mount looks ok (/workflow should be read write from data)!
And I know that (from the volume standpoint) there are no errors, because the indexed job runs, and the main issue is that it can't find the data files.
So - I think there might be some issue with either permissions, missing metadata somewhere (perhaps for that weird size?) or something to do with an encryption key that I need an instruction for? Any help you might provide would be greatly appreciated! I've brought up my testing cluster a few times in the last couple of days, and I'm trying to find other examples online, but I've reached the point I'm not sure what to try next (and I hope you have some ideas).