jhu-idc / iDC-general

Contains non-code-base specific tickets relating to the Islandora8 for Digital Collection project
0 stars 0 forks source link

Investigate to fully understand when new ids/uris are minted by Islandora for its resources #52

Closed htpvu closed 4 years ago

htpvu commented 4 years ago

This is in the context of deciding whether system generated IDs can be use as persistent and citable ids. To make this decision, we need to firmly understand how persistent the Islandora generated IDs are, how/when/to whom they are minted.

In case of complete and total system crash/destruction, in which objects in the system has to be regenerated, we need to know whether previously assigned system generated ids can be reinstated to ensure that the persistent ids are still... persistent.

htpvu commented 4 years ago

this is related to the to be updated iDC19 usecase.

bseeger commented 4 years ago

Drupal's MySQL DB has a table called "node" which contains info on all nodes in the system (objects created from content types like Repository Item). In the table there is a column called "nid" representing the node id, which is based on an index that gets auto-incremented upon node creation (it uses AUTO_INCREMENT in MySQL to come up with the next id). This "nid" is then used in the repository item URL : http://i8p.cloud.library.jhu.edu:8000/node/{{nid}}.

So they are extremely custom to the specific install of Drupal. It would be extremely hard to rebuild a system from scratch and get the nids to match up with the same object they did before. You'd have to know the exact ingest order and if any objects had been deleted (and therefore id's used up and skipped).

In terms of the nid changing in a system, unless the object is deleted the node id, and therefore the URL, should stay the same for an object.

Drupal offers a few ways to alias these URLs so they don't show the nid: (the URL with the nid will still resolve with these options, btw):

  1. One is via a Drupal core module named Path that allows an admin to assign a different path to an object. This is available out of the box. (example: http://i8p.cloud.library.jhu.edu:8000/handle/1234 is really: http://i8p.cloud.library.jhu.edu:8000/node/76/)
  2. There is also a module called PathAuto that helps automatically create paths for certain nodes based on some system tokens that are available (tokens are based on fields). It's incredibly configurable and then the path is just there with no extra work on our parts. We could have it automatically create paths based on node title, or any field on a node. How to construct paths from Drupal fields would have to be thought through a bit, but a possible sound approach.

Path is a Drupal Core module - meaning it comes with Drupal.

Path Auto is Drupal module and supported by the Drupal community. Seems like a well supported module.

bseeger commented 4 years ago

What I was thinking about was coming up with a way, based on the metadata, to create a unique URL for an item. For example, if you had a certain piece of metadata, you could generate the URL. But that requires unique pieces of metadata and I'm not sure we will have that for all data, unless we use something like handles for all items.

Because of this, Path Auto might not be a good fit for this, unless we have one distinct and required piece of metadata information on an object - else we might get a duplicate URL.

We might want to talk about minting our own URLs for items and setting them in the Drupal Path Alias so that Drupal will redirect these paths to the proper node. I would base it on something that we store in the data, so we can go back and recreate the objects if the system were to totally melt down.

Everything I think of seems to need some unique piece of metadata - be it a handle or identifier that won't change.

One option is to keep a mapping of Object -> URL outside of islandora. Then we can use that map to create webserver redirects if we did need them. But what piece of metadata do we use to uniquely identify what the Object is, if we're not guaranteed to have a handle?

emetsger commented 4 years ago

In prep for the iDC meeting today, just some comments/questions, as I understand the issue.

Under what circumstances will the URL for an entity in Drupal/Islandora change? And it sounds like the answer is a worst case scenario: corruption or failure of the database from whence there is no recovery; the only option is to re-ingest the content. (If the database is corrupted or fails, but we follow best practices, then it would be able to be restored, and in that scenario (we'd need to test) IDs/primary keys won't change, and URLs remain the same.)

If the catastrophic scenario occurs (e.g. restoration requires re-ingesting the content), a couple of things to consider: there's no metadata, it has been lost. And any identifiers that we have minted externally cannot be linked to the original record in the database, because their primary keys will be re-generated when the repository is re-populated on ingest. The only way externally minted URIs can be linked to records would be to 1) have a backup of of the content of the repository, including item metadata, 2) insure that each item contains metadata that can be used as an input to a function to generate a predictable key.

birkland commented 4 years ago

I agree with Elliot's (1) and (2). We'll have (1), (2) definitely has benefits (i.e. if there's an ID decoupled from URL, that the URL is derived from). Is there such ID? Have we definitely decided to move away from handles, and their ilk (like arks)? It's unclear if we'd have logical/intentional IDs for records that are different from Drupal's internal identifiers

emetsger commented 4 years ago

I take the position that if we follow best practices, there is no need for externally managed URIs for Islandora resources. The default identifier based on the node ID is sufficient, and practical.

If we wish to manage external URIs for Islandora resources, we will be committing to maintaining another set of infrastructure for their support, and another set of backup and restore practices that we don't have right now. Consider that we have attempted this already with JHIR URIs, and we have expressed reluctance and doubt about their utility and cost. Why maintain multiple URIs for a resource when it is highly unlikely that the "native" URI for a resource will change, and further that it will be costly to maintain external URI mappings? The cost to benefit ratio seems high.

birkland commented 4 years ago

I see, so your standpoint is that even if we had identifiers (which is still not clear to me, from a curation/library standpoint, if records will be assigned intentional identifiers), then we wouldn't want to do (2) anyway? That leaves (1), rely on backups, which is fine.

bseeger commented 4 years ago

Thanks for thinking about this some more. We landed on (1) in the iDC meeting and just making note of that here.

Originally I had been thinking about how to implement (2) from above and if we could figure out what a key was from some metadata field. I'm fine with where this landed though as it sounds like that's not needed.