Debug snakemake file download failure

related-sciences / ukb-gwas-pipeline-nealelab

Pipeline for reproduction of NealeLab 2018 UKB GWAS

4 stars 3 forks source link

Debug snakemake file download failure #25

Closed eric-czech closed 3 years ago

eric-czech commented 3 years ago

The bug documented here occurs only in GKE and only when the files snakemake is downloading are sufficiently large: https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/20#issuecomment-732383672

eric-czech commented 3 years ago

Running the job within the same container on a host outside of GCP also works without the same error (much like running the job outside of Docker). Commands:

docker run --rm -it -v /home/eczech/repos/ukb-gwas-pipeline-nealelab:/tmp/ukb-gwas-pipeline-nealelab snakemake/snakemake:v5.22.1 /bin/bash

cd /tmp/ukb-gwas-pipeline-nealelab 
snakemake rs-ukb/prep/gt-imputation/ukb_chr19.ckpt --snakefile Snakefile --force -j --keep-target-files --keep-remote --latency-wait 5 --attempt 1 --force-use-threads --wrapper-prefix https://github.com/snakemake/snakemake-wrappers/raw/ --allowed-rules bgen_to_zarr --nocolor --notemp --no-hooks --nolock --use-conda --default-remote-provider GS --default-remote-prefix rs-ukb

Building DAG of jobs...
Creating conda environment envs/gwas.yaml...
Downloading and installing remote packages.
Environment for envs/gwas.yaml created (location: .snakemake/conda/891b36f5)
Using shell: /bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   bgen_to_zarr
    1

[Mon Nov 30 19:08:01 2020]
rule bgen_to_zarr:
    input: rs-ukb/raw/gt-imputation/ukb_imp_chr19_v3.bgen, rs-ukb/raw/gt-imputation/ukb_mfi_chr19_v3.txt, rs-ukb/raw/gt-imputation/ukb59384_imp_chr19_v3_s487296.sample
    output: rs-ukb/prep/gt-imputation/ukb_chr19.ckpt
    jobid: 0
    wildcards: bgen_contig=19
    threads: 4
    resources: mem_mb=117964

Downloading from remote: rs-ukb/raw/gt-imputation/ukb59384_imp_chr19_v3_s487296.sample
Finished download.
Downloading from remote: rs-ukb/raw/gt-imputation/ukb_mfi_chr19_v3.txt
Finished download.
Downloading from remote: rs-ukb/raw/gt-imputation/ukb_imp_chr19_v3.bgen
Finished download.
Activating conda environment: /tmp/ukb-gwas-pipeline-nealelab/.snakemake/conda/891b36f5
2020-11-30 19:22:57,217 | __main__ | INFO | Loading BGEN dataset for contig Contig(name=19, index=18) from rs-ukb/raw/gt-imputation/ukb_imp_chr19_v3.bgen (chunks = (250, -1))

eric-czech commented 3 years ago

Oddly, I also didn't get the error today when running the job on a GKE cluster with a single 8 vCPU / 32 G node.

eric-czech commented 3 years ago

And it works on a 32 vCPU / 128 G single node GKE cluster. That same setup failed in several different clusters before so I have no idea what changed -- perhaps a minor version on an unpinned transitive dependency.

eric-czech commented 3 years ago

It failed on a 2 node 64 vCPU / 256 G cluster. I'm trying again on a one node cluster and I saw a potentially important difference. On the 1+ node clusters, the containers are launched differently:

eric@gke-ukb-io-1-default-pool-e9d05906-7hj4 ~ $ docker ps
CONTAINER ID        IMAGE                                                  COMMAND                  CREATED             STATUS              PORTS               NAMES
6d5aac8dc88d        08c495ee1a04                                           "/bin/sh -c 'cp -rf …"   5 minutes ago       Up 5 minutes                            k8s_snakejob-e8a0ff28-a3dd-5eee-9c74-b290dcbd2
458_snakejob-e8a0ff28-a3dd-5eee-9c74-b290dcbd2458_default_6d545277-6381-4748-beff-138ca49e97b3_0
ae07f83ec045        k8s.gcr.io/pause:3.1                                   "/pause"                 5 minutes ago       Up 5 minutes                            k8s_POD_snakejob-e8a0ff28-a3dd-5eee-9c74-b290d
cbd2458_default_6d545277-6381-4748-beff-138ca49e97b3_0

On the single node cluster, the snakemake image is run directly instead:

eric@gke-ukb-io-1-default-pool-850dc7e9-h7pn ~ $ docker ps
CONTAINER ID        IMAGE                                                  COMMAND                  CREATED             STATUS              PORTS               NAMES
598c384ae5b9        snakemake/snakemake                                    "/bin/sh -c 'cp -rf …"   2 minutes ago       Up 2 minutes                            k8s_snakejob-7102e7f0-e4ea-52a1-bfc2-b365203aa7ad_snakejob-7102e7f0-e4ea-52a1-bfc2-b365203aa7ad_defaul
t_58bedd5e-4482-45e6-ab96-a30e5e6f72c0_0
9efd6032ffdc        k8s.gcr.io/pause:3.1                                   "/pause"                 3 minutes ago       Up 3 minutes                            k8s_POD_snakejob-7102e7f0-e4ea-52a1-bfc2-b365203aa7ad_default_58bedd5e-4482-45e6-ab96-a30e5e6f72c0_0

eric-czech commented 3 years ago

I saw a system message like this in the run that failed:

[Mon Nov 30 21:29:45 2020] systemd-journald[505]: File /var/log/journal/cc5fc13313e37fce36c17c5d140fb355/system.journal corrupted or uncleanly shut down, renaming and replacing.

I'm not sure what that means, but it if it is somehow related to disk utilization then this is how volumes are mounted to containers/hosts and how they fill as the file downloads in a single node cluster:

on host:

eric@gke-ukb-io-1-default-pool-850dc7e9-h7pn ~ $ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       2.0G  943M 1018M  49% /
devtmpfs        126G     0  126G   0% /dev
tmpfs           126G     0  126G   0% /dev/shm
tmpfs           126G  2.0M  126G   1% /run
tmpfs           126G     0  126G   0% /sys/fs/cgroup
tmpfs           126G  164K  126G   1% /etc/machine-id
tmpfs           256K     0  256K   0% /mnt/disks
tmpfs           126G     0  126G   0% /tmp
overlayfs       126G  164K  126G   1% /etc
/dev/sda8        12M   28K   12M   1% /usr/share/oem
/dev/sda1       193G   50G  143G  26% /mnt/stateful_partition # --> this volume increases in usage as file downloads 
tmpfs           1.0M  132K  892K  13% /var/lib/cloud

within container:

(snakemake) root@snakejob-7102e7f0-e4ea-52a1-bfc2-b365203aa7ad:/workdir# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay         193G   57G  137G  30% /            # --> root volume increases in usage 
tmpfs            64M     0   64M   0% /dev
tmpfs           126G     0  126G   0% /sys/fs/cgroup
tmpfs           126G  176K  126G   1% /source
/dev/sda1       193G   57G  137G  30% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs           126G   12K  126G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           126G     0  126G   0% /proc/acpi
tmpfs           126G     0  126G   0% /proc/scsi
tmpfs           126G     0  126G   0% /sys/firmware

eric-czech commented 3 years ago

It also failed on a single node 64 vCPU cluster. Near when the failure occurred, my ssh session terminated. I was able to login again after a couple minutes though and see disk utilization like this:

eric@gke-ukb-io-1-default-pool-850dc7e9-h7pn ~ $ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       2.0G  943M 1018M  49% /
devtmpfs        126G     0  126G   0% /dev
tmpfs           126G     0  126G   0% /dev/shm
tmpfs           126G  2.0M  126G   1% /run
tmpfs           126G     0  126G   0% /sys/fs/cgroup
tmpfs           126G  164K  126G   1% /etc/machine-id
tmpfs           256K     0  256K   0% /mnt/disks
tmpfs           126G     0  126G   0% /tmp
overlayfs       126G  164K  126G   1% /etc
/dev/sda8        12M   28K   12M   1% /usr/share/oem
/dev/sda1       193G   57G  137G  30% /mnt/stateful_partition
tmpfs           1.0M  132K  892K  13% /var/lib/cloud

I also saw this message in the system logs again:

[Mon Nov 30 21:59:23 2020] systemd-journald[506]: File /var/log/journal/760db0e802d549c8fdaec891295a9bdc/system.journal corrupted or uncleanly shut down,
 renaming and replacing.

Between that and the ssh behavior, I'm becoming more certain that the disk is being filled despite the fact that the utilization tracked by GCP looks like this:

Screen Shot 2020-11-30 at 5 08 45 PM

The bgen file being downloaded is ~50G so snakemake must duplicate it and I was severely underestimating how much disk space is allocated by GKE and the OS. I attached 200G but I will try something larger.

eric-czech commented 3 years ago

That appears to be it. Increasing the disk size works, though I still don't see how 200G wasn't enough just for the download. Here is the utilization after successful downloads on a 64 vCPU 2 node cluster (GCP monitoring again shows no spike):

# host
eric@gke-ukb-io-1-default-pool-300b9124-j8mz ~ $ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       2.0G  943M 1018M  49% /
devtmpfs        126G     0  126G   0% /dev
tmpfs           126G     0  126G   0% /dev/shm
tmpfs           126G  1.5M  126G   1% /run
tmpfs           126G     0  126G   0% /sys/fs/cgroup
tmpfs           126G  164K  126G   1% /etc/machine-id
tmpfs           126G     0  126G   0% /tmp
tmpfs           256K     0  256K   0% /mnt/disks
overlayfs       126G  164K  126G   1% /etc
/dev/sda8        12M   28K   12M   1% /usr/share/oem
/dev/sda1       292G   66G  226G  23% /mnt/stateful_partition
tmpfs           1.0M  132K  892K  13% /var/lib/cloud

# guest 
(snakemake) root@snakejob-b4255d2d-2121-5536-bd97-35589c54dd80:/workdir# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay         292G   68G  224G  24% /
tmpfs            64M     0   64M   0% /dev
tmpfs           126G     0  126G   0% /sys/fs/cgroup
tmpfs           126G  176K  126G   1% /source
/dev/sda1       292G   68G  224G  24% /etc/hosts
shm              64M   12K   64M   1% /dev/shm
tmpfs           126G   12K  126G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           126G     0  126G   0% /proc/acpi
tmpfs           126G     0  126G   0% /proc/scsi
tmpfs           126G     0  126G   0% /sys/firmware

eric-czech commented 3 years ago

Disk size was apparently the issue. I don't know how 60G of the disk allocated is used by kubernetes/docker, but the containers see only ~140G of a 200G disk. I was never able to verify that snakemake is duplicating the downloaded file. That is a likely explanation though and it might be worth digging through the code at some point to figure out why.

eric-czech commented 3 years ago

Maybe it wasn't that simple .. I saw this error again today in a job processing chromosome 4 (160.29 GiB) data on a machine with a 1TB disk. The larger chromosome 1 and 2 files (~185GiB) ran on the same machine type without this issue. Here are the logs:

Screen Shot 2020-12-04 at 8 07 32 AM

Again it was a 255 exit code with no logs after the download starts.