Open alexiswl opened 1 week ago
@skanwal
Do these come in as Standard
by default, or are they placed into Intelligent Tiering
with the default frequent access, and then need to be moved into Archive Instant
? Seems like a lot of bams are already in Intelligent Tiering
.
Pls do run past check with Flo. Managed by bucket life cycle? @reisingerf
The bucket (byob prefix) is configured to push everything into IT (see here).
AFAIK we can't decide on the storage tier when it's under IT (this is then handled automatically)
Hang on.
I recall from early discussion, we wish not to put any objects in operational-ready store (like pipeline-cache bucket) into archive tier classes. But to move them into a dedicated archive bucket.
Do we change this view now that - due to complexity/current situation? Let us catchup again, pls.
AFAIK we can't decide on the storage tier when it's under IT (this is then handled automatically)
Adding to this, I think we can specify Archive Access
or Deep Archive Access
, but then it's no longer instant-ish retrieval, and it needs to be restored. But yeah, it looks like the other tiers are handled automatically by AWS.
What about 'S3 Glacier Instant Retrieval', which we could force .bam files to after say one week? This would be the same storage pricing as Archive Instant ($5 per Tb per month), but we don't need to wait for 30 days of standard storage ($25 per Tb per month), plus another 60 days of Infrequent Access Tier ($13 per Tb), to get to Archive Instant Retrieval, which is immediately reset to Frequent Access when the data is touched.
The API / retrieval pricing of S3 Glacier Instant Retrieval is $0.03 per Gb, so a 100 Gb bam would cost $3 to retrieve.
The same bam would cost $5.10 in the first 90 days of storage on Intelligent Tiering.
All good points, but optimisations in my view...
Ultimately, I'd like to get to a point where we have different storage back-ends, with different retention / tiering options, and can choose between them based on use case (project, research, clinical, ... ) and potentially cost attribution. I think the OrcaBus system can handle that, but it will take some time to get set up and automated.
Having said that: Yes, for well known use cases / projects, we could start by changing the lifecycle configuration and manage it on a per cohort/project prefix rather than for the whole BYOB share. Note: we need to change and split the current setup (instead of just "overwriting" with more specific configurations). See: https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-conflicts.html
We have a set of cohort data split into projects enter the following
Any bam files in the
cohort-*
directories should be sent to Intelligent Tiering 'Archive Instant'.