Closed eric-czech closed 3 years ago
Running the job within the same container on a host outside of GCP also works without the same error (much like running the job outside of Docker). Commands:
docker run --rm -it -v /home/eczech/repos/ukb-gwas-pipeline-nealelab:/tmp/ukb-gwas-pipeline-nealelab snakemake/snakemake:v5.22.1 /bin/bash
cd /tmp/ukb-gwas-pipeline-nealelab
snakemake rs-ukb/prep/gt-imputation/ukb_chr19.ckpt --snakefile Snakefile --force -j --keep-target-files --keep-remote --latency-wait 5 --attempt 1 --force-use-threads --wrapper-prefix https://github.com/snakemake/snakemake-wrappers/raw/ --allowed-rules bgen_to_zarr --nocolor --notemp --no-hooks --nolock --use-conda --default-remote-provider GS --default-remote-prefix rs-ukb
Building DAG of jobs...
Creating conda environment envs/gwas.yaml...
Downloading and installing remote packages.
Environment for envs/gwas.yaml created (location: .snakemake/conda/891b36f5)
Using shell: /bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 bgen_to_zarr
1
[Mon Nov 30 19:08:01 2020]
rule bgen_to_zarr:
input: rs-ukb/raw/gt-imputation/ukb_imp_chr19_v3.bgen, rs-ukb/raw/gt-imputation/ukb_mfi_chr19_v3.txt, rs-ukb/raw/gt-imputation/ukb59384_imp_chr19_v3_s487296.sample
output: rs-ukb/prep/gt-imputation/ukb_chr19.ckpt
jobid: 0
wildcards: bgen_contig=19
threads: 4
resources: mem_mb=117964
Downloading from remote: rs-ukb/raw/gt-imputation/ukb59384_imp_chr19_v3_s487296.sample
Finished download.
Downloading from remote: rs-ukb/raw/gt-imputation/ukb_mfi_chr19_v3.txt
Finished download.
Downloading from remote: rs-ukb/raw/gt-imputation/ukb_imp_chr19_v3.bgen
Finished download.
Activating conda environment: /tmp/ukb-gwas-pipeline-nealelab/.snakemake/conda/891b36f5
2020-11-30 19:22:57,217 | __main__ | INFO | Loading BGEN dataset for contig Contig(name=19, index=18) from rs-ukb/raw/gt-imputation/ukb_imp_chr19_v3.bgen (chunks = (250, -1))
Oddly, I also didn't get the error today when running the job on a GKE cluster with a single 8 vCPU / 32 G node.
And it works on a 32 vCPU / 128 G single node GKE cluster. That same setup failed in several different clusters before so I have no idea what changed -- perhaps a minor version on an unpinned transitive dependency.
It failed on a 2 node 64 vCPU / 256 G cluster. I'm trying again on a one node cluster and I saw a potentially important difference. On the 1+ node clusters, the containers are launched differently:
eric@gke-ukb-io-1-default-pool-e9d05906-7hj4 ~ $ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6d5aac8dc88d 08c495ee1a04 "/bin/sh -c 'cp -rf …" 5 minutes ago Up 5 minutes k8s_snakejob-e8a0ff28-a3dd-5eee-9c74-b290dcbd2
458_snakejob-e8a0ff28-a3dd-5eee-9c74-b290dcbd2458_default_6d545277-6381-4748-beff-138ca49e97b3_0
ae07f83ec045 k8s.gcr.io/pause:3.1 "/pause" 5 minutes ago Up 5 minutes k8s_POD_snakejob-e8a0ff28-a3dd-5eee-9c74-b290d
cbd2458_default_6d545277-6381-4748-beff-138ca49e97b3_0
On the single node cluster, the snakemake image is run directly instead:
eric@gke-ukb-io-1-default-pool-850dc7e9-h7pn ~ $ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
598c384ae5b9 snakemake/snakemake "/bin/sh -c 'cp -rf …" 2 minutes ago Up 2 minutes k8s_snakejob-7102e7f0-e4ea-52a1-bfc2-b365203aa7ad_snakejob-7102e7f0-e4ea-52a1-bfc2-b365203aa7ad_defaul
t_58bedd5e-4482-45e6-ab96-a30e5e6f72c0_0
9efd6032ffdc k8s.gcr.io/pause:3.1 "/pause" 3 minutes ago Up 3 minutes k8s_POD_snakejob-7102e7f0-e4ea-52a1-bfc2-b365203aa7ad_default_58bedd5e-4482-45e6-ab96-a30e5e6f72c0_0
I saw a system message like this in the run that failed:
[Mon Nov 30 21:29:45 2020] systemd-journald[505]: File /var/log/journal/cc5fc13313e37fce36c17c5d140fb355/system.journal corrupted or uncleanly shut down, renaming and replacing.
I'm not sure what that means, but it if it is somehow related to disk utilization then this is how volumes are mounted to containers/hosts and how they fill as the file downloads in a single node cluster:
on host:
eric@gke-ukb-io-1-default-pool-850dc7e9-h7pn ~ $ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 2.0G 943M 1018M 49% /
devtmpfs 126G 0 126G 0% /dev
tmpfs 126G 0 126G 0% /dev/shm
tmpfs 126G 2.0M 126G 1% /run
tmpfs 126G 0 126G 0% /sys/fs/cgroup
tmpfs 126G 164K 126G 1% /etc/machine-id
tmpfs 256K 0 256K 0% /mnt/disks
tmpfs 126G 0 126G 0% /tmp
overlayfs 126G 164K 126G 1% /etc
/dev/sda8 12M 28K 12M 1% /usr/share/oem
/dev/sda1 193G 50G 143G 26% /mnt/stateful_partition # --> this volume increases in usage as file downloads
tmpfs 1.0M 132K 892K 13% /var/lib/cloud
within container:
(snakemake) root@snakejob-7102e7f0-e4ea-52a1-bfc2-b365203aa7ad:/workdir# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 193G 57G 137G 30% / # --> root volume increases in usage
tmpfs 64M 0 64M 0% /dev
tmpfs 126G 0 126G 0% /sys/fs/cgroup
tmpfs 126G 176K 126G 1% /source
/dev/sda1 193G 57G 137G 30% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 126G 12K 126G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 126G 0 126G 0% /proc/acpi
tmpfs 126G 0 126G 0% /proc/scsi
tmpfs 126G 0 126G 0% /sys/firmware
It also failed on a single node 64 vCPU cluster. Near when the failure occurred, my ssh session terminated. I was able to login again after a couple minutes though and see disk utilization like this:
eric@gke-ukb-io-1-default-pool-850dc7e9-h7pn ~ $ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 2.0G 943M 1018M 49% /
devtmpfs 126G 0 126G 0% /dev
tmpfs 126G 0 126G 0% /dev/shm
tmpfs 126G 2.0M 126G 1% /run
tmpfs 126G 0 126G 0% /sys/fs/cgroup
tmpfs 126G 164K 126G 1% /etc/machine-id
tmpfs 256K 0 256K 0% /mnt/disks
tmpfs 126G 0 126G 0% /tmp
overlayfs 126G 164K 126G 1% /etc
/dev/sda8 12M 28K 12M 1% /usr/share/oem
/dev/sda1 193G 57G 137G 30% /mnt/stateful_partition
tmpfs 1.0M 132K 892K 13% /var/lib/cloud
I also saw this message in the system logs again:
[Mon Nov 30 21:59:23 2020] systemd-journald[506]: File /var/log/journal/760db0e802d549c8fdaec891295a9bdc/system.journal corrupted or uncleanly shut down,
renaming and replacing.
Between that and the ssh behavior, I'm becoming more certain that the disk is being filled despite the fact that the utilization tracked by GCP looks like this:
The bgen file being downloaded is ~50G so snakemake must duplicate it and I was severely underestimating how much disk space is allocated by GKE and the OS. I attached 200G but I will try something larger.
That appears to be it. Increasing the disk size works, though I still don't see how 200G wasn't enough just for the download. Here is the utilization after successful downloads on a 64 vCPU 2 node cluster (GCP monitoring again shows no spike):
# host
eric@gke-ukb-io-1-default-pool-300b9124-j8mz ~ $ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 2.0G 943M 1018M 49% /
devtmpfs 126G 0 126G 0% /dev
tmpfs 126G 0 126G 0% /dev/shm
tmpfs 126G 1.5M 126G 1% /run
tmpfs 126G 0 126G 0% /sys/fs/cgroup
tmpfs 126G 164K 126G 1% /etc/machine-id
tmpfs 126G 0 126G 0% /tmp
tmpfs 256K 0 256K 0% /mnt/disks
overlayfs 126G 164K 126G 1% /etc
/dev/sda8 12M 28K 12M 1% /usr/share/oem
/dev/sda1 292G 66G 226G 23% /mnt/stateful_partition
tmpfs 1.0M 132K 892K 13% /var/lib/cloud
# guest
(snakemake) root@snakejob-b4255d2d-2121-5536-bd97-35589c54dd80:/workdir# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 292G 68G 224G 24% /
tmpfs 64M 0 64M 0% /dev
tmpfs 126G 0 126G 0% /sys/fs/cgroup
tmpfs 126G 176K 126G 1% /source
/dev/sda1 292G 68G 224G 24% /etc/hosts
shm 64M 12K 64M 1% /dev/shm
tmpfs 126G 12K 126G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 126G 0 126G 0% /proc/acpi
tmpfs 126G 0 126G 0% /proc/scsi
tmpfs 126G 0 126G 0% /sys/firmware
Disk size was apparently the issue. I don't know how 60G of the disk allocated is used by kubernetes/docker, but the containers see only ~140G of a 200G disk. I was never able to verify that snakemake is duplicating the downloaded file. That is a likely explanation though and it might be worth digging through the code at some point to figure out why.
Maybe it wasn't that simple .. I saw this error again today in a job processing chromosome 4 (160.29 GiB) data on a machine with a 1TB disk. The larger chromosome 1 and 2 files (~185GiB) ran on the same machine type without this issue. Here are the logs:
Again it was a 255 exit code with no logs after the download starts.
The bug documented here occurs only in GKE and only when the files snakemake is downloading are sufficiently large: https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/20#issuecomment-732383672