netwerk-digitaal-erfgoed / cm-implementation-guidelines

Implementation guidelines for NDE alignment of cultural heritage data management and publication infrastructure
https://netwerk-digitaal-erfgoed.github.io/cm-implementation-guidelines/
2 stars 0 forks source link

Design considerations - specific or generic? #9

Closed RolfBly closed 4 years ago

RolfBly commented 4 years ago

(This is intended as a discussion piece. It's not really an 'issue')

Heritage institutes have, in their collection management software, some data about, for example, creators. They want to link their data to verified external sources, ("authority sources"), so that creators of the objects in their collection can be identified unambigously. A record for Gerrit Rietveld might have this:

Source          Number
======          ======
VIAF            https://viaf.org/viaf/15562113/ 
RKD Artists     https://rkd.nl/nl/explore/artists/66880

There are lots and lots of different authority sources. (Wikidata being probably the most versatile and comprehensive - but let's not digress).

The collection data itself is to be used as a source in its own right; it wants to be part of LOD. Assume the collection database has an API that can be queried by URL, and can respond with data in XML or JSON. What would be the preferred way to do so?

The specific way could look like this. Let's assume for the sake of the example that RKD requires a license statement if one re-uses their data, and VIAF doesn't. (Also, disregard human-readable vs machine-readable for now)

<person>
  <name>Rietveld, Gerrit</name>
  <role>creator</role>
  <VIAF_URI>https://viaf.org/viaf/15562113/</VIAF_URI>
  <RKD_URL>https://rkd.nl/nl/explore/artists/66880</RKD_URL>
  <RKD_URL_license>CC-BY</RKD_URL_license>
</person>

The main advantage of specific is that it gives the same information in fewer bytes than the generic way (see below).

(Aside: that's why I chose this method for an interface that allows users of Adlib software to ingest data from Getty's AAT and the WOII thesaurus, from within Adlib. Adlib will show you a list of max 20 hits to a query for a term; the smaller the reply, the better responsiveness in Adlib.)

The generic way, I guess, would be preferable if one wants to use that data for some LOD application. That would look like this:

<person>
  <name>Rietveld, Gerrit</name>
  <role>creator</role>
  <source>VIAF</source>
  <source>RKD Artists</source>
  <number>https://viaf.org/viaf/15562113/</number>
  <number>https://rkd.nl/nl/explore/artists/66880</number>
  <license/>
  <license>CC-BY</license>
</person>

These field names are taken from Adlib Model Application version 4.5 and up (including Axiell Collections Model Application version 5).

I would appreciate any comments on this. That's my first question.

Furthermore, if the above is what you'd get from an Adlib/Axiell API, you can transform it using XSLT, server-side. You could configure the API so that it responds with something compliant with schema.org, or turtle. That being the case, which format(s) would best suit LOD usability? That's my second question.

coret commented 4 years ago

To provide an alternative for the generic variant (in XML to stay in line with the given examples), I'd suggest something like:

<person>
  <name>Rietveld, Gerrit</name>
  <role>creator</role>
  <authority_sources>
    <source>
        <name>VIAF</name>
        <link>https://viaf.org/viaf/15562113/</link>
    </source>
    <source>
        <name>RKD Artists</name>
        <link>https://rkd.nl/nl/explore/artists/66880</link>
        <license>CC-BY</license>
    </source>   
  </authority_sources>
</person>

In this variant it's more clear which data 'belongs together' and what it is, so it's more clear.

ivozandhuis commented 4 years ago

@RolfBly I'm not sure to what extent this answers your question, but the first requirement for data to be usable as Linked Data is to publish it in RDF-format. An RDF-serialization encodes the triples that 'link' the cultural heritage object to the creator. The above XML examples can not be parsed as such, generic nor specific. Does that answer your question?

RolfBly commented 4 years ago

@ivozandhuis Yes, that answers my question, but it turns out I actually was asking 'what would be the easiest basis to build the interface to LOD on?' And that's where @coret's answer comes in.

This is what wants to be part of LOD:

database -> API -> XML

As you say, that's not enough. But we can have this:

database -> API -> XML -> XSLT -> RDF

We'd have to make the XSLT that turns the XML into RDF. Would it help if the XML is as self-explanatory as possible? And what is that? Or doesn't it really matter? Most of the time, the end result is this:

database -> API -> XML -> interface -> datastore -> RDF

where interface and datastore are usually from a third party. The easier the XML, the sooner they'll have that interface working. Plus, ideally, interface and datastore are user-configurable.

Any more thoughts?

ivozandhuis commented 4 years ago

The internal technical workflow from database to RDF is different per architecture of the collection management system. We consider this the problem of the collection management system and its supplier. In this document we abstract from this and hope the system wants to comply to the principals of the RDF publication paradigm (resolvable, persistent URI's, content negotiation, SPARQL endpoint).