monarch-initiative / monarch-app

Monarch Initiative website and API
https://monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
18 stars 6 forks source link

Documenting Monarch Data Archiving and PURL strategy #358

Open sagehrke opened 1 year ago

sagehrke commented 1 year ago

From Monarch Data call on 2023-Sept-28, the following was identified as our Monarch Data Archiving and PURL strategy moving forward:

  1. data.mi.org is being equipped with a fairly stable file structure (great), but: better safe then sorry and use .htaccess redirects to complement this for stability. Kevin comments also that this is a great way to document what kind of files are important to the outside world.
  2. No one argued against data.mi.org stability. It seems there is a positive feeling that it can be maintained on a reasonable time horizon (5-10 years). No one seems to be against managing our PURls on data.mi.org.
  3. Archiving is a very different issue; while zenodo isnt great for access to individual files, it is a great place for “permanence”.
  4. TDLR: we will manage monarch purls on data.mi.org with an .htaccess file, and when the project dies, we redirect the PURLs to the Zenodo mega dumps.
sagehrke commented 1 year ago

@kltm responded to above:

@matentzn responded:

matentzn commented 1 year ago

Just as a quick summary of what is needed:

  1. We need a way to access individual release assets (tables, ingests, mapping sets, semantic similarity tables) via some form of reasonably stable URL. We need access to (1) every versioned release (version URL) and (2) the "latest" release.
  2. We need a way to archive our data for posterity persistently. There is no way around Zenodo (or some similar system) here, due to the size of our data.
  3. Because of Zenodos limitations (in particular: (1) it is not a file server and (2) you cannot access the latest individual release assets and (3) (to a lesser extent relevant than 1 and 2) the DOIs Zenodo dishes out are not mnemonic (e.g. some semi-random combination of characters), which diminishes the marketing effect of our URLs), we have to support a "file server" that is independent of our archive. There is a bit of conceptual overlap between an archive and a fileserver that supports versions (they are not the same), but I am not too bothered about it tbh.

IMO, the current strategy should be:

  1. Use data.mi.org as a fileserver while the project lasts, for both the latest and versioned files.
  2. While we do so, we archive our data for posterity on Zenodo for persistence.
  3. To increase stability of URLs on data.mi.org (even if this does not reach the conceptual stability of PURLs) we decouple file location from URL using a redirect system using htaccess.
kltm commented 1 year ago

I'd agree with @matentzn completely on 1 and 2, with the addition that there should be a priority on 2 to also use Zenodo for failsafe/recovery or figure out another system to do so. For 3, I think it might be a bit much to specify mechanism so early when only the effect is desired. URL stability can also be accomplished by really considering a robust layout scheme and only making additions to it moving forward. There are many forwarding, mirroring, and mapping mechanisms out there, so no need to overspecify on htaccess. As well, mapping, remapping, changing tech, changing sites: it can quickly become a hard-to-maintain mess. Better to get it right the first time, if that option is on the table.

matentzn commented 1 year ago

As well, mapping, remapping, changing tech, changing sites: it can quickly become a hard-to-maintain mess. Better to get it right the first time, if that option is on the table.

Hm. I see we value two important things differently: you value simplicity of the solution, which makes everything more maintainable in the future, while I value early flexibility (at the expense of simplicity) by decoupling of location and path to ensure that changes to the file structure are always possible, and to create a sort of "contract" for the data engineering team to see which files they must absolutely guarantee access two (which may be only a fraction of the files served). Its not that I am diametrically opposed to your position @kltm but its a 80-20 sort of stance. I have much less trust in the idea that "we can get it right the first time" then you do :)