ucd-library / ezid

Bash client Script for ezid identifier management
MIT License
0 stars 2 forks source link

UC Davis DAMS local data #1

Open qjhart opened 5 years ago

qjhart commented 5 years ago

Currently, we don't really have too much of a standard for metadata creation when we mint an ARK. To date, the basic plan is that we 1) Add the minimum erc.who,what,when to the ark 2) Add a resolution service to the ARK to point to the DAMs entry.

For example, if we look at this example from the Sherry Lehmann files, we see that we've added, from ark:/87287/d75k5d

success: ark:/87287/d75k5d
erc.who: Sherry Wine & Spirits Co.,Inc.
erc.when: 1957
erc.what: Annual Money Saving Sale 1957

The problem is that it seems like a pretty paltry way for us to know what that identifier is for. Now, you will see that we also give a target back to the DAMS:

_target: https://digital.ucdavis.edu/ark:/87287/d75k5d

So currently, there is a way to see more of the information there, but I wonder if we want a more standard method of connecting this back additional metdata. For example, If you look at the Linked Data you will see we have the identifier: schema:identifier "Roy Brady Collection; D202, Box 5, Folder 10". So,we could include that in our ark as well. There are a few methods for that. We could create our own metadata field for that.

ucd.identifier: Roy Brady Collection; D202, Box 5, Folder 10

This would be carried along with the ARK, but you wouldn't see it in the UI from the browser.

Or we could use a different metadata scheme for our ARKs. They are explained in detail in the API. It's not clear to me that one is better than the other.

Schema.org

Alternatively, we could include even more schema.org metadata in the ARK. In this case, we could maintain our original metadata, and erc. for consistency, and then use a viable subset of our complete metadata, enough to properly identify the issue.

I can see three methods for this: 1) Add a schema. profile 2) embed a schema.org text record into one field, or 3) embed a data:url into a single field. Each have their own advantages and disadvantages, that we'll discuss below: In every example below, we are looking to markup the item with the following

@prefix schema: <http://shema.org> .
<>  a                schema:PublicationIssue;
      schema:creator          "Sherry Wine & Spirits Co.,Inc.";
      schema:material         "paper"^^<http://www.w3.org/2001/XMLSchema#string> ;
      schema:publisher        "Sherry Wine & Spirits Co.,Inc.";
      schema:datePublished    "1957"^^<http://www.w3.org/2001/XMLSchema#gYear> ;
      schema:name             "Annual Money Saving Sale 1957";
      schema:identifier       "ark:/87287/d75k5d";
      schema:identifier       "Roy Brady Collection; D202, Box 5, Folder 10";
      schema:about            <http://id.worldcat.org/fast/1175887> ;
      schema:about            <http://id.worldcat.org/fast/1008232> ;
      schema:isPartOf <https://digital.ucdavis.edu/fcrepo/rest/collection/sherry-lehmann>;
      schema:isPartOf <https://oac.cdlib.org/findaid/ark:/13030/c8xk8hg8/>;
      schema:license          <http://rightsstatements.org/vocab/InC-NC/1.0/> ;
      schema:sdDatePublished    "2019"^^<http://www.w3.org/2001/XMLSchema#gYear>; 
      schema:sdPublisher <http://id.loc.gov/authorities/names/no2008108707>;
      .

If you have jena tools you can validate the above like:turtle --base=foo: --syntax=turtle --output=turtle <<<$record

Metadata profile ark:/99999/fk4698d060

One idea might be to replicate a subset of schema.org. For example, predicates for identification of an item. You could presumably format these, by using the anvl key as the predicate, and the anvl value as the object. You could choose a format to differentiate between links and text, as well as a fomat to differentiate multiple values for a single key. Below is an example of this where we've used the text/turtle format to specify text/link and multiples.

This is how we'd create records like this (using our ezid script)

ezid mint --proxy=https://digital.udavis.edu/  \
erc.who:'Sherry Wine & Spirits Co.,Inc.'  \
erc.when:1957 \
erc.what:'Annual Money Saving Sale 1957' \
rdf.a:'<http://schema.org/PublicationIssue>' \
schema.creator:"Sherry Wine & Spirits Co.,Inc." \
schema.material:"paper" \
schema.publisher:"Sherry Wine & Spirits Co.,Inc." \
schema.datePublished:"1957"  \
schema.name:'"Annual Money Saving Sale 1957"' \
schema.identifier:'"ark:/87287/d75k5d","Roy Brady Collection; D202, Box 5, Folder 10"' \
schema.isPartOf:"<https://digital.ucdavis.edu/fcrepo/rest/collection/sherry-lehmann>,<https://oac.cdlib.org/findaid/ark:/13030/c8xk8hg8/>" \
schema.license:"<http://rightsstatements.org/vocab/InC-NC/1.0/>" \
schema.sdDatePublished:'"2019"' \
schema.sdPublisher:'<http://id.loc.gov/authorities/names/no2008108707>'

This works okay, the major problems is that the format of the values within these is complicated and different then the other items eg. the erc. entries. You've now created a set of fields in the record that need to be formatted differently, and it's hard to decide when to do that.

How this could be rendered on the ezid site, is complicated by those same issues.

Embedded Schema.org ark:/99999/fk4p85hx94

This example is more closely aligns with one method of DOI identification, where the datacite: identifier holds an complete XML record. For our case, we'd use a format like text/turtle to format the record. For example we could store a more complete record in a single schema.org field.

Here's a more complete example of minting a new record, with our ezid tool.

ezid mint --proxy=https://digital.udavis.edu/ \
erc.who:'Sherry Wine & Spirits Co.,Inc.' \
erc.when:1957 \
erc.what:'Annual Money Saving Sale 1957' \
schema.org:'@prefix schema: <http://shema.org> . <> a schema:PublicationIssue; schema:creator "Sherry Wine & Spirits Co.,Inc."; schema:material "paper"^^<http://www.w3.org/2001/XMLSchema#string> ; schema:publisher "Sherry Wine & Spirits Co.,Inc."; schema:datePublished "1957"^^<http://www.w3.org/2001/XMLSchema#gYear> ; schema:name "Annual Money Saving Sale 1957"; schema:identifier "ark:/87287/d75k5d"; schema:identifier "Roy Brady Collection; D202, Box 5, Folder 10"; schema:about <http://id.worldcat.org/fast/1175887> ; schema:about <http://id.worldcat.org/fast/1008232> ; schema:isPartOf <https://digital.ucdavis.edu/fcrepo/rest/collection/sherry-lehmann>; schema:isPartOf <https://oac.cdlib.org/findaid/ark:/13030/c8xk8hg8/>; schema:license <http://rightsstatements.org/vocab/InC-NC/1.0/> ; schema:sdDatePublished "2019"^^<http://www.w3.org/2001/XMLSchema#gYear>; schema:sdPublisher <http://id.loc.gov/authorities/names/no2008108707>; .'

This has the disadvantage of not being able to query on a particular field of the schema.org, but we have limited the formatting issue to a single field. If we didn't compel all users to use a standard format, we'd still have UI issues in general.

Embedded schema.org metadata as a data.url ark:/99999/fk4cv5rt32

A final potential idea to elimate the formating qustion is to provide a single anvl key, but instead, use a data.url as the value. This would allow for multiple forms of data for the schema.org key. In this case, you create a data url from your tutle file, and add that as the value.

ezid mint --proxy=https://digital.udavis.edu/ \
erc.who:'Sherry Wine & Spirits Co.,Inc.' \
erc.when:1957 \
erc.what:'Annual Money Saving Sale 1957' \
schema.org:'data:text/turtle;charset=utf-8;base64,QHByZWZpeCBzY2hlbWE6IDxodHRwOi8vc2hlbWEub3JnPiAuDQo8PiAgYSAgICAgICAgICAgICAgICBzY2hlbWE6UHVibGljYXRpb25Jc3N1ZTsNCiAgICAgIHNjaGVtYTpjcmVhdG9yICAgICAgICAgICJTaGVycnkgV2luZSAmIFNwaXJpdHMgQ28uLEluYy4iOw0KICAgICAgc2NoZW1hOm1hdGVyaWFsICAgICAgICAgInBhcGVyIl5ePGh0dHA6Ly93d3cudzMub3JnLzIwMDEvWE1MU2NoZW1hI3N0cmluZz4gOw0KICAgICAgc2NoZW1hOnB1Ymxpc2hlciAgICAgICAgIlNoZXJyeSBXaW5lICYgU3Bpcml0cyBDby4sSW5jLiI7DQogICAgICBzY2hlbWE6ZGF0ZVB1Ymxpc2hlZCAgICAiMTk1NyJeXjxodHRwOi8vd3d3LnczLm9yZy8yMDAxL1hNTFNjaGVtYSNnWWVhcj4gOw0KICAgICAgc2NoZW1hOm5hbWUgICAgICAgICAgICAgIkFubnVhbCBNb25leSBTYXZpbmcgU2FsZSAxOTU3IjsNCiAgICAgIHNjaGVtYTppZGVudGlmaWVyICAgICAgICJhcms6Lzg3Mjg3L2Q3NWs1ZCI7DQogICAgICBzY2hlbWE6aWRlbnRpZmllciAgICAgICAiUm95IEJyYWR5IENvbGxlY3Rpb247IEQyMDIsIEJveCA1LCBGb2xkZXIgMTAiOw0KICAgICAgc2NoZW1hOmFib3V0ICAgICAgICAgICAgPGh0dHA6Ly9pZC53b3JsZGNhdC5vcmcvZmFzdC8xMTc1ODg3PiA7DQogICAgICBzY2hlbWE6YWJvdXQgICAgICAgICAgICA8aHR0cDovL2lkLndvcmxkY2F0Lm9yZy9mYXN0LzEwMDgyMzI+IDsNCiAgICAgIHNjaGVtYTppc1BhcnRPZiA8aHR0cHM6Ly9kaWdpdGFsLnVjZGF2aXMuZWR1L2ZjcmVwby9yZXN0L2NvbGxlY3Rpb24vc2hlcnJ5LWxlaG1hbm4+Ow0KICAgICAgc2NoZW1hOmlzUGFydE9mIDxodHRwczovL29hYy5jZGxpYi5vcmcvZmluZGFpZC9hcms6LzEzMDMwL2M4eGs4aGc4Lz47DQogICAgICBzY2hlbWE6bGljZW5zZSAgICAgICAgICA8aHR0cDovL3JpZ2h0c3N0YXRlbWVudHMub3JnL3ZvY2FiL0luQy1OQy8xLjAvPiA7DQogICAgICBzY2hlbWE6c2REYXRlUHVibGlzaGVkICAgICIyMDE5Il5ePGh0dHA6Ly93d3cudzMub3JnLzIwMDEvWE1MU2NoZW1hI2dZZWFyPjsgDQogICAgICBzY2hlbWE6c2RQdWJsaXNoZXIgPGh0dHA6Ly9pZC5sb2MuZ292L2F1dGhvcml0aWVzL25hbWVzL25vMjAwODEwODcwNz47DQogICAgICAu'

This now is completely opaque to users, but it does have the advantage of providing the best out of box UI experience, since following that url would resolve to a properly formated file

qjhart commented 5 years ago

After review from our Special Collection and metadata specialists, the general consensus is that a small schema.org record would be a desirable addition to the records. At least enough information to discover the orignal source of the information. That would include schema: identifier(s), license, creator, material, name, publisher and datePublished. Probably not include any descriptive fields like :about.

Two other questions have been brought up. First, is the question of the identifier. Is ucd.ttl a good name? 1) Is (ucd.) a good prefix, is there a best practice for that? and 2) the key ~.ttl~ is really more about the format and not the item. Maybe instead, we should use something like ucd.linked_data?

Now, in terms of overthinking it, since this isn't just text, and there are multiple potential formats for linked data, do we standardize on one; something human readable like ttl, or something more machine friendly like json-ld. And if we didn't specifiy, and allowed multiple versions, where to we specify format? This has been solved with data urls, but adds a level of obfuscation to the identifier,

@prefix : <http://schema.org/>. 
<>  rdf:type  :CreativeWork ; 
:creator "Sherry Wine & Spirits Co.,Inc."; 
:material "paper"; 
:publisher "Sherry Wine & Spirits Co.,Inc."; 
:name "Annual Money Saving Sale 1957"; 
:identifier "ark:/87287/d75k5d", "Roy Brady Collection; D202, Box 5, Folder 10"; 
:license <http://rightsstatements.org/vocab/InC-NC/1.0/> ; 
:datePublished    "1957" .

Becomes this data.url

data.url:data:text/turtle;base64,QHByZWZpeCA6IDxodHRwOi8vc2NoZW1hLm9yZy8+LiAKPD4gIHJkZjp0eXBlICA6Q3JlYXRpdmVX
b3JrIDsgCjpjcmVhdG9yICJTaGVycnkgV2luZSAmIFNwaXJpdHMgQ28uLEluYy4iOyAKOm1hdGVy
aWFsICJwYXBlciI7IAo6cHVibGlzaGVyICJTaGVycnkgV2luZSAmIFNwaXJpdHMgQ28uLEluYy4i
OyAKOm5hbWUgIkFubnVhbCBNb25leSBTYXZpbmcgU2FsZSAxOTU3IjsgCjppZGVudGlmaWVyICJh
cms6Lzg3Mjg3L2Q3NWs1ZCIsICJSb3kgQnJhZHkgQ29sbGVjdGlvbjsgRDIwMiwgQm94IDUsIEZv
bGRlciAxMCI7IAo6bGljZW5zZSA8aHR0cDovL3JpZ2h0c3N0YXRlbWVudHMub3JnL3ZvY2FiL0lu
Qy1OQy8xLjAvPiA7IAo6ZGF0ZVB1Ymxpc2hlZCAgICAiMTk1NyIgLgo=
qjhart commented 5 years ago

Notes from John Kunze: If I understood correctly, one piece of insight could come from knowing that EZID will take and record whatever arbitrary metadata you give it, and you can see it all by querying the N2T.net resolver using the '??' inflection. (One '?' gives you a brief record, and two '??' gives you the full record.) Not much of a UI, but it might tide things over for a bit?

For example, I massaged your metadata a bit and attached it to a throwaway identifier, which you can query with

https://n2t.net/ark:/99999/fk9876??

which returns

erc:  
who: Sherry Wine & Spirits Co.,Inc.; Sherry Wine & Spirits Co.,Inc.
what: Annual Money Saving Sale 1957; Annual Money Saving Sale 1957
when: 1957
where: ark:/99999/fk9876 (currently https://www.cdlib.org);
    ark:/87287/d75k5d
how: schema:CreativeWork
dc.creator: Sherry Wine & Spirits Co.,Inc.
dc.publisher: Sherry Wine & Spirits Co.,Inc.
dc.title: Annual Money Saving Sale 1957
id created: 2019.04.18_11:42:03
location: Roy Brady Collection; D202, Box 5, Folder 10
persistence: (:unav)
xx.about: http://id.worldcat.org/fast/1008232
xx.datePublished: 1957
xx.license: http://rightsstatements.org/vocab/InC-NC/1.0/
xx.material: paper

This is one of the ideas we have; most specifically a ucd:identifier to at least get to the physical object. The problem that we came to however were with items like xx:about. How do you specify two them? And how do you differentiate between a link and text? Once try and figure that out, we have to define a syntax, and our idea is we already have a linked data syntax, and we don't have (or really need) search capability, on xx:about, so why not just use the syntax we have, and put those all in one item.