related-sciences / ukb-gwas-pipeline-nealelab

Pipeline for reproduction of NealeLab 2018 UKB GWAS
4 stars 3 forks source link

Estimate cost of GWAS regression steps #32

Open eric-czech opened 3 years ago

eric-czech commented 3 years ago

This is an estimate of the VM rental time necessary to do the GWAS regressions (similar to https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/8).

Here are current figures:

A ballpark cost to keep a cluster of this size running that long is 60 nodes (231 days 24 hrs) * $0.946424/hr = $314,818.

Clearly we have got to find some room to improve this.

eric-czech commented 3 years ago

On instance pricing:

If various dask memory issues could be solved and we could use preemptible standard instances, the total cost would be around $53k.

ravwojdyla commented 3 years ago

@eric-czech I assume this stat:

Running phenotypes on chr21 (141,910 variants) for 11 hr 5 mins produced results for 265 phenotypes when using a cluster of 60 n1-highmem-16 instances

and this comment https://github.com/pystatgen/sgkit/issues/390#issuecomment-748205731 are for the same run, correct? And if it is, how is the 11 hr 5 mins here, connected with about 2hrs it took to run the regressions for chr21?

eric-czech commented 3 years ago

and this comment pystatgen/sgkit#390 (comment) are for the same run, correct?

No, that caption is definitely misleading -- I was either wrong when I wrote it or trying to make it clear that the individual phenotypes can be seen as single spikes. Here is a full version of that readout that also includes the run of the 265 phenotypes:

Screen Shot 2021-01-21 at 4 22 06 PM