populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Autoclass is changing all the file modified dates #272

Closed MattWellie closed 1 year ago

MattWellie commented 1 year ago

Current Report Index page generation is utilising the Last Modified date on GCP files to find which files to present as 'latest'. The AutoClass migration seems to be ruining this concept... This Issue lines out a working hypothesis...

Every time an old file is downgraded to a different storage tier it is copied to a new location (??) This creates a new edit date (??) This results in older files being flagged as latest on the report

Still need to nail down the cause here.

Might transition to using the dates in file paths instead of using created/modified dates.

cassimons commented 1 year ago

Ugg what a pain, and what bad semantics from gcp.

lgruen commented 1 year ago

The original migration incurred a change of the "modified time", as we had to made a full copy of all bucket contents to change the bucket's class to Autoclass. However, as far as I understand this should have been a one-time thing -- i.e. that should not happen for class migrations now that Autoclass is active...

MattWellie commented 1 year ago

Dropping in an example here - I've tried to tiptoe around this as I believe loading the HTML page may re-locate the file into standard storage and reset all the dates...

https://console.cloud.google.com/storage/browser/cpg-ravenscroft-arch-main-web/reanalysis/2023-02-20/summary_output.html

This file has a created date of 17 Feb 2023, 10:04 matching the time of the AutoClass migration, and a modified date of 19 Mar 2023, 10:54. The storage class is Nearline so it has not been accessed since the AutoClass migration. That last modified change came ~30 days after the migration took place, and the modified time was around Saturday at midnight, so must have been an automated change.

The file I linked above was superseded by a new run in a separate folder on March 13th, which was picked up in the report index at the time. Due to the modified timestamp change the older file has now retaken position as 'latest' in the AIP index.

If this is right and the modified date is being tweaked by AutoClass that's a bit irritating, but a clean solution is for AIP to be logging analysis outputs into metamist, and reading the latest reports the same way - I probably should have been doing that from the start so it's not a huge problem.

MattWellie commented 1 year ago

I'm closing this as mitigated by the move to Metamist result logging

lgruen commented 1 year ago

I'll ask Google tech support about this anyway.