Documenting Monarch Data Archiving and PURL strategy

sagehrke commented 1 year ago

From Monarch Data call on 2023-Sept-28, the following was identified as our Monarch Data Archiving and PURL strategy moving forward:

data.mi.org is being equipped with a fairly stable file structure (great), but: better safe then sorry and use .htaccess redirects to complement this for stability. Kevin comments also that this is a great way to document what kind of files are important to the outside world.
No one argued against data.mi.org stability. It seems there is a positive feeling that it can be maintained on a reasonable time horizon (5-10 years). No one seems to be against managing our PURls on data.mi.org.
Archiving is a very different issue; while zenodo isnt great for access to individual files, it is a great place for “permanence”.
TDLR: we will manage monarch purls on data.mi.org with an .htaccess file, and when the project dies, we redirect the PURLs to the Zenodo mega dumps.

When this strategy is finalized, please add it to the Monarch Technical Documentation here: https://monarch-initiative.github.io/monarch-technical-documentation/

sagehrke commented 1 year ago

@kltm responded to above:

I think I have a slightly different read on 2 and 4, although it may be splitting hairs, so please take with a grain of salt.
For 2, I'm not sure that PURLs are the best way of framing data stability. My understanding was that if stability can be maintained, a "formal" PURL system may not be strictly necessary as "stable" URLs are good enough. In my mind, a PURL is likely run by a party with some kind of separate commitment to exist longer than any individual that participates in it. That could be the BBC making sure that URLs for old programs are still usable or that OBO makes usable archive links for ontologies that become unsupported. Or, another way: grant funding is stable, but not permanent--time horizons are important and 5 years (~a grant cycle) is very different than 10.
For 4, depending on the granularity, I don't think it is really possible. Zenodo works best on project-level concepts, so a PURL to an ontology file would not really work redirected to Zenodo except as a "good luck with that". It's mixing archive with PURLs and end-user browsability.

@matentzn responded:

For 4, I was thinking something that requires a human in the loop. So, say the day comes, Monarch dies, you redirect all the file level purls to the mega archives that contain them.. This is of course far from ideal, but at least it gives a chance to perform historical analyses.. The only other alternative is uploading the individual files to zenodo (zipped). They can be attached to the same record, but be selected individually, like, say “https://zenodo.org/record/8098888/files/ontology_versions.tsv?download=1”. In any case, we need file level PURLs for all our data products; one way or another we need to support that use case.
For 2 Yes, it would not be a formal PURL system, more like a “good luck with that” system where the redirection layer (htaccess) guards purely against changes to the local file system (and adds a bit of transparency as to which files we consider products for the community). The only other PURL system we could hop on to is w3id.org, which was what I had originally in mind, but it seemed the group didnt really think there was a stability advantange over data.mi.org..

matentzn commented 1 year ago

Just as a quick summary of what is needed:

We need a way to access individual release assets (tables, ingests, mapping sets, semantic similarity tables) via some form of reasonably stable URL. We need access to (1) every versioned release (version URL) and (2) the "latest" release.
We need a way to archive our data for posterity persistently. There is no way around Zenodo (or some similar system) here, due to the size of our data.
Because of Zenodos limitations (in particular: (1) it is not a file server and (2) you cannot access the latest individual release assets and (3) (to a lesser extent relevant than 1 and 2) the DOIs Zenodo dishes out are not mnemonic (e.g. some semi-random combination of characters), which diminishes the marketing effect of our URLs), we have to support a "file server" that is independent of our archive. There is a bit of conceptual overlap between an archive and a fileserver that supports versions (they are not the same), but I am not too bothered about it tbh.

IMO, the current strategy should be:

Use data.mi.org as a fileserver while the project lasts, for both the latest and versioned files.
While we do so, we archive our data for posterity on Zenodo for persistence.
To increase stability of URLs on data.mi.org (even if this does not reach the conceptual stability of PURLs) we decouple file location from URL using a redirect system using htaccess.

kltm commented 1 year ago

I'd agree with @matentzn completely on 1 and 2, with the addition that there should be a priority on 2 to also use Zenodo for failsafe/recovery or figure out another system to do so. For 3, I think it might be a bit much to specify mechanism so early when only the effect is desired. URL stability can also be accomplished by really considering a robust layout scheme and only making additions to it moving forward. There are many forwarding, mirroring, and mapping mechanisms out there, so no need to overspecify on htaccess. As well, mapping, remapping, changing tech, changing sites: it can quickly become a hard-to-maintain mess. Better to get it right the first time, if that option is on the table.

matentzn commented 1 year ago

As well, mapping, remapping, changing tech, changing sites: it can quickly become a hard-to-maintain mess. Better to get it right the first time, if that option is on the table.

Hm. I see we value two important things differently: you value simplicity of the solution, which makes everything more maintainable in the future, while I value early flexibility (at the expense of simplicity) by decoupling of location and path to ensure that changes to the file structure are always possible, and to create a sort of "contract" for the data engineering team to see which files they must absolutely guarantee access two (which may be only a fraction of the files served). Its not that I am diametrically opposed to your position @kltm but its a 80-20 sort of stance. I have much less trust in the idea that "we can get it right the first time" then you do :)

monarch-initiative / monarch-app

Documenting Monarch Data Archiving and PURL strategy #358