opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Archive old release data #1516

Closed JarrodBaker closed 2 years ago

JarrodBaker commented 3 years ago

As an administrator of Open Targets I want to automatically move old release data to cold storage after a certain period of time to reduce operating expenditures.

Background

Google storage has several categories of 'at rest' data storage cost per GB per month. The different categories are explained here. Currently all our data is in standard storage.

We can reduce our costs by moving to a lower access level. At rest storage costs are reduced by a factor of 5 for coldline storage, and ~16 for archive storage.

The data would still be available from the FTP server if required, or revived from the lower storage level.

Technical note

The first time we archive a large amount of data we would expect to see an expenses spike as moving data to a new storage category is billed an an operation on the more expensive category. Eg. moving 1000 items from standard storage to archive storage would be billed as 1000 operations in the archive category. (See Operations costs)

Acceptance tests

  1. When I examine a recent release all the artifacts are in standard storage.
  2. When I examine an old release (how old?) the artifacts are in either archive or coldline storage.
  3. When I examine an old release in 'archived mode', that release data is available via the EBI FTP server.
  4. When a new release is completed, the oldest non-archived release data is archived.
d0choa commented 2 years ago

@JarrodBaker and @cmalangone the savings in platform are probably not huge. If this comes easy to you go for it, but the savings might not be worth your time.

cmalangone commented 2 years ago

I think we could move the old data to "at rest" data storage. We need to check the granularity of the bucket and eventually change the different type of storage or eventually create a new bucket with "at rest" data storage.

But I think @JarrodBaker got a good point about cost and reason why to move towards "at rest" data storage

andrewhercules commented 2 years ago

I'll investigate what the costs would be, but for now it would be ideal to move all data from releases in 2020 and earlier to archive storage. This would only impact the data on GCP as all data would continue to be available via the EBI FTP service.

andrewhercules commented 2 years ago

@ktsirigos, this is something you should discuss with @JarrodBaker and also with Google to see if we can optimise our storage and retrieval costs. This could also help prevent the increased costs seen in the open-targets-prod in Q3 and Q4 2021.

d0choa commented 2 years ago

@mbdebian would you consider any pending actions after your most recent review?

mbdebian commented 2 years ago

Although we have our data release bucket populated with all our release data, we only offer the latest one on BigQuery and we have all those data releases available through the EBI FTP service. Unless there is a use case for our community to be able to access our data via Google Cloud Storage API, I would suggest to just remove data releases from our bucket, probably keeping just the last year worth of data, while having all of them at the EBI FTP. The type of storage behind a bucket is for the whole bucket, it cannot be fine tuned on a per folder basis. @d0choa and @JarrodBaker , cold line storage is cheaper than standard storage, we just need to think about the cost peak of transferring the data from our current bucket to the cold line / archive bucket, but keep in mind that it marginally improves our operational cost, at the expense of making it more expensive and slow for the community that might be consuming our data releases from Google Storage.