umccr / orcabus

 🐋 UMCCR Pipeline & Workflow Orchestration
4 stars 0 forks source link

All large files in cohort folders should be placed on archiveInstant tiering #719

Open alexiswl opened 1 week ago

alexiswl commented 1 week ago

We have a set of cohort data split into projects enter the following

aws s3 ls s3://pipeline-prod-cache-503977275616-ap-southeast-2/byob-icav2/
                           PRE cohort-apgi-prod/
                           PRE cohort-brca-atlas-prod/
                           PRE cohort-column-pi-prod/
                           PRE cohort-hmf-pdac-prod/
                           PRE cohort-pdac-prod/
                           PRE cohort-super-prod/
                           PRE ctdna-tso-v2-6-validation-prod/
                           PRE external-agrf-prod/
                           PRE production/
                           PRE reference-data/
                           PRE validation-data/
                           PRE wgs-accreditation-prod/

Any bam files in the cohort-* directories should be sent to Intelligent Tiering 'Archive Instant'.

alexiswl commented 1 week ago

@skanwal

mmalenic commented 1 week ago

Do these come in as Standard by default, or are they placed into Intelligent Tiering with the default frequent access, and then need to be moved into Archive Instant? Seems like a lot of bams are already in Intelligent Tiering.

victorskl commented 1 week ago

Pls do run past check with Flo. Managed by bucket life cycle? @reisingerf

reisingerf commented 1 week ago

The bucket (byob prefix) is configured to push everything into IT (see here).

AFAIK we can't decide on the storage tier when it's under IT (this is then handled automatically)

victorskl commented 6 days ago

Hang on.

I recall from early discussion, we wish not to put any objects in operational-ready store (like pipeline-cache bucket) into archive tier classes. But to move them into a dedicated archive bucket.

Do we change this view now that - due to complexity/current situation? Let us catchup again, pls.

mmalenic commented 6 days ago

AFAIK we can't decide on the storage tier when it's under IT (this is then handled automatically)

Adding to this, I think we can specify Archive Access or Deep Archive Access, but then it's no longer instant-ish retrieval, and it needs to be restored. But yeah, it looks like the other tiers are handled automatically by AWS.

alexiswl commented 5 days ago

What about 'S3 Glacier Instant Retrieval', which we could force .bam files to after say one week? This would be the same storage pricing as Archive Instant ($5 per Tb per month), but we don't need to wait for 30 days of standard storage ($25 per Tb per month), plus another 60 days of Infrequent Access Tier ($13 per Tb), to get to Archive Instant Retrieval, which is immediately reset to Frequent Access when the data is touched.

The API / retrieval pricing of S3 Glacier Instant Retrieval is $0.03 per Gb, so a 100 Gb bam would cost $3 to retrieve.

The same bam would cost $5.10 in the first 90 days of storage on Intelligent Tiering.

reisingerf commented 5 days ago

All good points, but optimisations in my view...

Ultimately, I'd like to get to a point where we have different storage back-ends, with different retention / tiering options, and can choose between them based on use case (project, research, clinical, ... ) and potentially cost attribution. I think the OrcaBus system can handle that, but it will take some time to get set up and automated.

Having said that: Yes, for well known use cases / projects, we could start by changing the lifecycle configuration and manage it on a per cohort/project prefix rather than for the whole BYOB share. Note: we need to change and split the current setup (instead of just "overwriting" with more specific configurations). See: https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-conflicts.html