opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Decide on storage class management for Gentropy Vault #3232

Closed tskir closed 3 months ago

tskir commented 4 months ago

As we discussed, we want to mirror certain datasets into a vault bucket in our Google Storage. Because we're talking about substantial amounts of data, we need to decide how to manage storage classes to minimise costs.

Options

We really only have two:

1. Archive storage + manual retrieval

In this scenario, the vault data is stored in the Archive class from the beginning. If we want to use the data, we manually copy it into a staging bucket (let's say for a month) and perform the work we need. Data storage is cheap, but we incur huge retrieval charges for every time we want to access it.

2. Autoclass storage

In this scenario, the entire vault bucket is configured using the Autoclass feature. All data starts in standard storage and progressively sinks into colder storage if it's not accessed:

With Autoclass, there are no retrieval charges (even from cold storage) and no early deletion charges.

Cost comparison

Assuming the total size of the data is 50 TB:

Essentially, if the data is ever retrieved, Autoclass costs pretty much the same as the Archive storage with manual retrieval. However, for data which is literally never (not once) retrieved during its lifetime, Archive storage is of course cheaper.

Spreadsheet with the calculations.

Recommendation

I am strongly leaning towards using Autoclass, because it just takes a lot of headache out of storage admin. There's no need to maintain a staging bucket for data, to manually copy it out, or remember to maintain special precautions to not accidentally incur a big retrieval charge.

Furthermore, storage classes are maintained on a per-object basis, so if only specific subsets of the datasets are accessed (for example, specific ancestries), it doesn't affect the rest of them.

@d0choa I'd like to hear your opinion and to make the final decision on this.

d0choa commented 4 months ago

Go for it! Thanks for looking into it

d0choa commented 4 months ago

cc @prashantuniyal02

tskir commented 4 months ago

Thank you for reviewing this quickly @d0choa! In this case I'll proceed with setting up the bucket as Autoclass

tskir commented 3 months ago

Vault set up. All details, including dataset ingestion instructions, are going to be stored in a private repository (will circulate in a Slack channel).