Closed eric-czech closed 3 years ago
I tried running conversions on preemptible VMs to see if we can save some money using those, but this experiment didn't go well. Here are some of the issues I ran into:
google.auth.exceptions.RefreshError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/452178172058-compute@developer.gserviceaccount.com/?recursive=true from the Google Compute Enginemetadata service. Compute Engine Metadata server unavailable
Traceback (most recent call last):
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/__init__.py", line 490, in snakemake
_default_remote_provider = rmt.RemoteProvider(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/remote/GS.py", line 93, in __init__
self.client = storage.Client(*args, **kwargs)
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/google/cloud/storage/client.py", line 110, in __init__
super(Client, self).__init__(
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/google/cloud/client.py", line 249, in __init__
_ClientProjectMixin.__init__(self, project=project)
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/google/cloud/client.py", line 203, in __init__
raise EnvironmentError(
OSError: Project was not passed and could not be determined from the environment.
I'm not going to experiment any more with these.
Zarr archive size after rechunking:
# All figures in GiB
gs://rs-ukb/prep/gt-imputation/ukb_chr1.zarr/ 199.7
gs://rs-ukb/prep/gt-imputation/ukb_chr10.zarr/ 123.57
gs://rs-ukb/prep/gt-imputation/ukb_chr11.zarr/ 118.78
gs://rs-ukb/prep/gt-imputation/ukb_chr12.zarr/ 119.63
gs://rs-ukb/prep/gt-imputation/ukb_chr13.zarr/ 88.75
gs://rs-ukb/prep/gt-imputation/ukb_chr14.zarr/ 84.64
gs://rs-ukb/prep/gt-imputation/ukb_chr15.zarr/ 79.83
gs://rs-ukb/prep/gt-imputation/ukb_chr16.zarr/ 86.69
gs://rs-ukb/prep/gt-imputation/ukb_chr17.zarr/ 76.74
gs://rs-ukb/prep/gt-imputation/ukb_chr18.zarr/ 71.63
gs://rs-ukb/prep/gt-imputation/ukb_chr19.zarr/ 65.49
gs://rs-ukb/prep/gt-imputation/ukb_chr2.zarr/ 207.6
gs://rs-ukb/prep/gt-imputation/ukb_chr20.zarr/ 57.18
gs://rs-ukb/prep/gt-imputation/ukb_chr21.zarr/ 40.12
gs://rs-ukb/prep/gt-imputation/ukb_chr22.zarr/ 40.65
gs://rs-ukb/prep/gt-imputation/ukb_chr3.zarr/ 170.49
gs://rs-ukb/prep/gt-imputation/ukb_chr4.zarr/ 172.79
gs://rs-ukb/prep/gt-imputation/ukb_chr5.zarr/ 154.65
gs://rs-ukb/prep/gt-imputation/ukb_chr6.zarr/ 153.04
gs://rs-ukb/prep/gt-imputation/ukb_chr7.zarr/ 147.12
gs://rs-ukb/prep/gt-imputation/ukb_chr8.zarr/ 133.72
gs://rs-ukb/prep/gt-imputation/ukb_chr9.zarr/ 114.27
gs://rs-ukb/prep/gt-imputation/ukb_chrXY.zarr/ 5.05
Command:
for f in `gsutil ls gs://rs-ukb/prep/gt-imputation | grep zarr`
do
echo $f
gsutil du -ch $f | grep -i total
done
This gives a total of 2.51 TiB, so the original estimate was close: 0.026 per GB 2.51 TiB = 0.026 2759.774 = ~$72.
The VM rental time needed was roughly $1,500 though. That includes a lot of repetition after failures though and with at least some of them fixed I think this would be more like $1,000 in a second pass.
For comparison, here are the original bgen file sizes:
> gsutil du -ch gs://rs-ukb/raw/gt-imputation/*.bgen
113.02 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr10_v3.bgen
108.66 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr11_v3.bgen
108.11 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr12_v3.bgen
80.63 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr13_v3.bgen
76.89 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr14_v3.bgen
71.31 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen
76.54 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr16_v3.bgen
68.1 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr17_v3.bgen
64.49 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr18_v3.bgen
58.72 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr19_v3.bgen
181.15 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr1_v3.bgen
51.31 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr20_v3.bgen
36.48 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr21_v3.bgen
36.61 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr22_v3.bgen
188.03 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr2_v3.bgen
156.05 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr3_v3.bgen
160.29 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr4_v3.bgen
141.17 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr5_v3.bgen
143.64 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr6_v3.bgen
133.91 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr7_v3.bgen
122.15 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr8_v3.bgen
103.27 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chr9_v3.bgen
4.58 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chrXY_v3.bgen
80.78 GiB gs://rs-ukb/raw/gt-imputation/ukb_imp_chrX_v3.bgen
2.31 TiB total
Stats for conversion of chr 21 and 22:
Compute Cost
Approximate cost for conversion of chr21 and 22 = 16.5 hrs $0.38 = $6.27. Extrapolating to all chromosomes: $6.27 (97M / (1.26M + 1.25M)) = $242.
Storage Cost
Estimated storage costs (see https://cloud.google.com/storage/pricing) = $0.026 per GB 2.32 TiB raw bgen 1.2 size inflation (https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/7) = .026 2550.867 1.2 = $79.58 monthly.