related-sciences / ukb-gwas-pipeline-nealelab

Pipeline for reproduction of NealeLab 2018 UKB GWAS
4 stars 3 forks source link

Estimate costs for complete BGEN conversion #8

Closed eric-czech closed 3 years ago

eric-czech commented 4 years ago

Stats for conversion of chr 21 and 22:

Compute Cost

Approximate cost for conversion of chr21 and 22 = 16.5 hrs $0.38 = $6.27. Extrapolating to all chromosomes: $6.27 (97M / (1.26M + 1.25M)) = $242.

Storage Cost

Estimated storage costs (see https://cloud.google.com/storage/pricing) = $0.026 per GB 2.32 TiB raw bgen 1.2 size inflation (https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/7) = .026 2550.867 1.2 = $79.58 monthly.

eric-czech commented 4 years ago

I tried running conversions on preemptible VMs to see if we can save some money using those, but this experiment didn't go well. Here are some of the issues I ran into:

ValueError: path 'variant_id' contains an array Environment for envs/io.yaml created (location: .snakemake/conda/a0f8b672) Using shell: /bin/bash Provided cores: 32 Rules claiming more threads will be scaled down. Job counts: count jobs 1 bgen_to_zarr 1 [Fri Oct 9 15:42:47 2020] rule bgen_to_zarr: input: rs-ukb/raw-data/gt-imputation/ukb_imp_chr21_v3.bgen, rs-ukb/raw-data/gt-imputation/ukb_mfi_chr21_v3.txt, rs-ukb/raw-data/gt-imputation/ukb59384_imp_chr21_v3_s487296.sample output: rs-ukb/prep-data/gt-imputation/ukb_chr21.ckpt jobid: 0 wildcards: bgen_contig=21 threads: 30 resources: mem_mb=124072 Downloading from remote: rs-ukb/raw-data/gt-imputation/ukb59384_imp_chr21_v3_s487296.sample Finished download. Downloading from remote: rs-ukb/raw-data/gt-imputation/ukb_imp_chr21_v3.bgen Finished download. Downloading from remote: rs-ukb/raw-data/gt-imputation/ukb_mfi_chr21_v3.txt Finished download. Activating conda environment: /workdir/.snakemake/conda/a0f8b672 2020-10-09 15:50:05,248 | __main__ | INFO | Loading BGEN dataset for contig Contig(name=21, index=20) from rs-ukb/raw-data/gt-imputation/ukb_imp_chr21_v3.bgen (chunks = (250, -1)) 2020-10-09 15:50:30,683 | __main__ | INFO | Rechunking dataset for contig Contig(name=21, index=20) to rs-ukb/prep-data/gt-imputation/ukb_chr21.zarr (chunks = (5216, 5792)) Traceback (most recent call last): File "scripts/convert_genetic_data.py", line 309, in fire.Fire() File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "scripts/convert_genetic_data.py", line 293, in bgen_to_zarr ds = rechunk_dataset( File "scripts/convert_genetic_data.py", line 215, in rechunk_dataset res = fn( File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/sgkit/io/bgen/bgen_reader.py", line 510, in rechunk_bgen rechunked = rechunker_api.rechunk( File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/rechunker/api.py", line 289, in rechunk copy_spec, intermediate, target = _setup_rechunk( File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/rechunker/api.py", line 368, in _setup_rechunk copy_spec = _setup_array_rechunk( File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/rechunker/api.py", line 488, in _setup_array_rechunk target_array = _zarr_empty( File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/rechunker/api.py", line 151, in _zarr_empty return store_or_group.empty( File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/zarr/hierarchy.py", line 891, in empty return self._write_op(self._empty_nosync, name, **kwargs) File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/zarr/hierarchy.py", line 658, in _write_op return f(*args, **kwargs) File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/zarr/hierarchy.py", line 897, in _empty_nosync return empty(store=self._store, path=path, chunk_store=self._chunk_store, File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/zarr/creation.py", line 215, in empty return create(shape=shape, fill_value=None, **kwargs) File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/zarr/creation.py", line 119, in create init_array(store, shape=shape, chunks=chunks, dtype=dtype, compressor=compressor, File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/zarr/storage.py", line 329, in init_array _init_array_metadata(store, shape=shape, chunks=chunks, dtype=dtype, File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/zarr/storage.py", line 347, in _init_array_metadata err_contains_array(path) File "/workdir/.snakemake/conda/a0f8b672/lib/python3.8/site-packages/zarr/errors.py", line 17, in err_contains_array raise ValueError('path %r contains an array' % path) ValueError: path 'variant_id' contains an array [Fri Oct 9 15:50:32 2020] shell: python scripts/convert_genetic_data.py bgen_to_zarr --input-path-bgen=rs-ukb/raw-data/gt-imputation/ukb_imp_chr21_v3.bgen --input-path-variants=rs-ukb/raw-data/gt-imputation/ukb_mfi_chr21_v3.txt --input-path-samples=rs-ukb/raw-data/gt-imputation/ukb59384_imp_chr21_v3_s487296.sample --output-path=rs-ukb/prep-data/gt-imputation/ukb_chr21.zarr --contig-name=21 --contig-index=20 --remote=True (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

google.auth.exceptions.RefreshError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/452178172058-compute@developer.gserviceaccount.com/?recursive=true from the Google Compute Enginemetadata service. Compute Engine Metadata server unavailable

Traceback (most recent call last):
  File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/__init__.py", line 490, in snakemake
    _default_remote_provider = rmt.RemoteProvider(
  File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/remote/GS.py", line 93, in __init__
    self.client = storage.Client(*args, **kwargs)
  File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/google/cloud/storage/client.py", line 110, in __init__
    super(Client, self).__init__(
  File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/google/cloud/client.py", line 249, in __init__
    _ClientProjectMixin.__init__(self, project=project)
  File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/google/cloud/client.py", line 203, in __init__
    raise EnvironmentError(
OSError: Project was not passed and could not be determined from the environment.

I'm not going to experiment any more with these.

eric-czech commented 3 years ago

Zarr archive size after rechunking:

# All figures in GiB
gs://rs-ukb/prep/gt-imputation/ukb_chr1.zarr/ 199.7
gs://rs-ukb/prep/gt-imputation/ukb_chr10.zarr/ 123.57
gs://rs-ukb/prep/gt-imputation/ukb_chr11.zarr/ 118.78
gs://rs-ukb/prep/gt-imputation/ukb_chr12.zarr/ 119.63
gs://rs-ukb/prep/gt-imputation/ukb_chr13.zarr/ 88.75
gs://rs-ukb/prep/gt-imputation/ukb_chr14.zarr/ 84.64
gs://rs-ukb/prep/gt-imputation/ukb_chr15.zarr/ 79.83
gs://rs-ukb/prep/gt-imputation/ukb_chr16.zarr/ 86.69
gs://rs-ukb/prep/gt-imputation/ukb_chr17.zarr/ 76.74
gs://rs-ukb/prep/gt-imputation/ukb_chr18.zarr/ 71.63
gs://rs-ukb/prep/gt-imputation/ukb_chr19.zarr/ 65.49
gs://rs-ukb/prep/gt-imputation/ukb_chr2.zarr/ 207.6
gs://rs-ukb/prep/gt-imputation/ukb_chr20.zarr/ 57.18
gs://rs-ukb/prep/gt-imputation/ukb_chr21.zarr/ 40.12
gs://rs-ukb/prep/gt-imputation/ukb_chr22.zarr/ 40.65
gs://rs-ukb/prep/gt-imputation/ukb_chr3.zarr/ 170.49
gs://rs-ukb/prep/gt-imputation/ukb_chr4.zarr/ 172.79
gs://rs-ukb/prep/gt-imputation/ukb_chr5.zarr/ 154.65
gs://rs-ukb/prep/gt-imputation/ukb_chr6.zarr/ 153.04
gs://rs-ukb/prep/gt-imputation/ukb_chr7.zarr/ 147.12
gs://rs-ukb/prep/gt-imputation/ukb_chr8.zarr/ 133.72
gs://rs-ukb/prep/gt-imputation/ukb_chr9.zarr/ 114.27
gs://rs-ukb/prep/gt-imputation/ukb_chrXY.zarr/ 5.05

Command:

 for f in `gsutil ls gs://rs-ukb/prep/gt-imputation | grep zarr` 
 do 
  echo $f
  gsutil du -ch $f | grep -i total
 done

This gives a total of 2.51 TiB, so the original estimate was close: 0.026 per GB 2.51 TiB = 0.026 2759.774 = ~$72.

The VM rental time needed was roughly $1,500 though. That includes a lot of repetition after failures though and with at least some of them fixed I think this would be more like $1,000 in a second pass.

eric-czech commented 3 years ago

For comparison, here are the original bgen file sizes:

> gsutil du -ch gs://rs-ukb/raw/gt-imputation/*.bgen
113.02 GiB   gs://rs-ukb/raw/gt-imputation/ukb_imp_chr10_v3.bgen
108.66 GiB   gs://rs-ukb/raw/gt-imputation/ukb_imp_chr11_v3.bgen
108.11 GiB   gs://rs-ukb/raw/gt-imputation/ukb_imp_chr12_v3.bgen
80.63 GiB    gs://rs-ukb/raw/gt-imputation/ukb_imp_chr13_v3.bgen
76.89 GiB    gs://rs-ukb/raw/gt-imputation/ukb_imp_chr14_v3.bgen
71.31 GiB    gs://rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen
76.54 GiB    gs://rs-ukb/raw/gt-imputation/ukb_imp_chr16_v3.bgen
68.1 GiB     gs://rs-ukb/raw/gt-imputation/ukb_imp_chr17_v3.bgen
64.49 GiB    gs://rs-ukb/raw/gt-imputation/ukb_imp_chr18_v3.bgen
58.72 GiB    gs://rs-ukb/raw/gt-imputation/ukb_imp_chr19_v3.bgen
181.15 GiB   gs://rs-ukb/raw/gt-imputation/ukb_imp_chr1_v3.bgen
51.31 GiB    gs://rs-ukb/raw/gt-imputation/ukb_imp_chr20_v3.bgen
36.48 GiB    gs://rs-ukb/raw/gt-imputation/ukb_imp_chr21_v3.bgen
36.61 GiB    gs://rs-ukb/raw/gt-imputation/ukb_imp_chr22_v3.bgen
188.03 GiB   gs://rs-ukb/raw/gt-imputation/ukb_imp_chr2_v3.bgen
156.05 GiB   gs://rs-ukb/raw/gt-imputation/ukb_imp_chr3_v3.bgen
160.29 GiB   gs://rs-ukb/raw/gt-imputation/ukb_imp_chr4_v3.bgen
141.17 GiB   gs://rs-ukb/raw/gt-imputation/ukb_imp_chr5_v3.bgen
143.64 GiB   gs://rs-ukb/raw/gt-imputation/ukb_imp_chr6_v3.bgen
133.91 GiB   gs://rs-ukb/raw/gt-imputation/ukb_imp_chr7_v3.bgen
122.15 GiB   gs://rs-ukb/raw/gt-imputation/ukb_imp_chr8_v3.bgen
103.27 GiB   gs://rs-ukb/raw/gt-imputation/ukb_imp_chr9_v3.bgen
4.58 GiB     gs://rs-ukb/raw/gt-imputation/ukb_imp_chrXY_v3.bgen
80.78 GiB    gs://rs-ukb/raw/gt-imputation/ukb_imp_chrX_v3.bgen
2.31 TiB     total