Overall strategy - Githubissues

richardofsussex commented 1 year ago

This is the current position overall. We have scripts for:

concept; target JSKOS; processor Jolt
place; target JSKOS; processor Jolt
agent; target Linked Art; processor Jolt
package; target Linked Art; processor Jolt
object; target Linked Art; processor XSLT 3.0

We also have sample outputs for each of these entity types - now on GitHub. Thus far there has been little review of, or comment on, these outputs. They need to be checked and 'signed off' as being sufficiently complete and adequately formatted.

Packages need updating: they lack the entity type and a _label which are expected by Linked Art. Agents don't get dates right (Jolt inadequacy) and appear in a weird order (Jolt 'feature').

We need to agree on our URL strategy, bearing in mind Linked Art's convention that the entity type should form part of the URI. We still don't have UIDs to use in the output JSON.

We should discuss XSLT vs. Jolt. My own opinion is that it is not possible to achieve a usable Linked Art result for object records using Jolt. (I have tried!) Clearly a mixed Jolt/XSLT approach isn't ideal from a maintenance point of view. It wouldn't take me long to rewrite the Jolt scripts as XSLT.

We have yet to address:

media; example media-41006
event; examples event-7, event-22
publication; examples publication-20, publication-21

(Please suggest better/more official examples - I can only access part of the uber-spreadsheet, and refuse to fiddle with it. Are there any other entity types I have missed?)

We need to agree on the target format for the remaining entity types, and then I need to implement scripts for them.

richardofsussex commented 1 year ago

Packages Jolt script and example output now updated.

RGShepherd commented 1 year ago

Some notes

From the meeting on 2023-04-20

Output review

@jpadfield to review @richardofsussex's current outputs in
- https://github.com/national-gallery/NG-CIIM/tree/main/Rosetta/JOLT/outputs
- https://github.com/national-gallery/NG-CIIM/tree/main/Rosetta/XSLT/outputs

URL strategy

Linked Art entity types: https://linked.art/api/1.0/endpoint/
In the CIIM, add the @datatype block to @link on publication - this will enable entity types to be determined if the CIIM entity doesn't map directly to the Linked Art ones (e.g. agents)
Set up URL rewrites to remove the Linked Art entity type from requested URIs

Target formats

events:
- Use Linked Art if it can incorporate dates, places, agents
media:
- Within the object record, deliver a Linked Art IIIF stub based on the Image API URL
- Media records in their own right: agreed that IIF, whilst not triple-based, does meet all the requirements for Linked Data
places:
- Considered using GeoJSON, https://geojson.org/, for places in their own right; agreed to stick with JSKOS for now
publications:
- Within the object record, deliver a Linked Art publication stub record
- Publications in their own right: use BIBFRAME, https://www.loc.gov/bibframe/

Test data

Rob to provide a new data dump incorporating UIDs and @datatypes on @link
agents
- agent-36
- agent-1007
- agent-1141
- agent-1184
- agent-1647
- agent-2945
- agent-7983 (test data only)
- agent-8025
concepts
- concept-39977
events
- event-7
- event-22
- event-25
multimedia
- media-21659
- media-2791
- media-30642
- media-39830
- media-41007
- media-1812114026
- n-6611-00-000029.tif
objects
- object-1540
- object-1550
- object-1686
- object-1755
- object-3002
- object-3520
- object-3616
- object-3798
- object-4023
- object-4047
- object-4668
- object-5201
- object-5392
- object-5959
- object-6352
- object-7693
- object-7694
- object-10152
- object-10608
- object-11419
- object-11671
- object-15198
- object-15225
- object-16573
- object-17030
- object-17057
- object-18513
- object-18535
- object-18788
packages
- package-878
- place-39582
places
- place-39582
publications
- OCoLC-923565846
- OCoLC-ocm04760487
- OCoLC-ocn1135893742
- OCoLC-ocn222491363
- publication-1184
- publication-182
- publication-184
- publication-21
- publication-34
- UKLoNG-3036
- UKLoNG-9926645550001551
- UKLoNG-9927675840001551

richardofsussex commented 1 year ago

Today I've been through the object instances listed above which I hadn't looked at previously. In my test data, object-15198, object-15225 and object-16573 are not found.

RGShepherd commented 1 year ago

Today I've been through the object instances listed above which I hadn't looked at previously. In my test data, object-15198, object-15225 and object-16573 are not found.

Expected; they're history collection / loan in objects that we don't (currently) publish

richardofsussex commented 1 year ago

I've just regenerated all the object records listed above (apart from the ones I didn't find) - they are all in the folder http://richardofsussex.me.uk/ng/ciim7-output/, e.g. http://richardofsussex.me.uk/ng/ciim7-output/object-4023.json. All the NG URLs they contain should now be in the required format, e.g. https://data.ng.ac.uk/0EFC-0008-0000-0000.

richardofsussex commented 1 year ago

Just thinking about the mechanics of this URL approach. I had assumed (wrongly) that the entity type would form part of the URL, and so its value could be parsed out and used within the transform. In particular, I have written different scripts for each entity type, so without a knowledge of that, we won't know which script to apply. (In fact, it's more extreme than that: some of the scripts are JOLT, while others are XSLT, so I don't even know which processing environment to call upon.)

Is this URL design decision irrevocable?

richardofsussex commented 1 year ago

Of course, we could parse the source JSON (which the XSLT already does), and branch the processing logic based on the value of hits.hits[0]._source.@datatype.base. However, that would require us to have a single 'script' which deals with all entity types. It would have to be written either in XSLT or in K-Int's new transformation framework.

richardofsussex commented 1 year ago

We need to agree on our URL strategy, bearing in mind Linked Art's convention that the entity type should form part of the URI.

I had forgotten that the entity type forming part of the URI is a Linked Art thing, not something I just invented for convenience. Does that change our view on the form our URL patterns should take? As planned, they won't be Linked Art-compatible, which seems like something of an own goal.

RGShepherd commented 1 year ago

Of course, we could parse the source JSON (which the XSLT already does), and branch the processing logic based on the value of hits.hits[0]._source.@datatype.base. However, that would require us to have a single 'script' which deals with all entity types. It would have to be written either in XSLT or in K-Int's new transformation framework.

That seems like a sensible option, and I'm fairly sure it'll be how the output transformations work in practice.

I had forgotten that the entity type forming part of the URI is a Linked Art thing, not something I just invented for convenience. Does that change our view on the form our URL patterns should take? As planned, they won't be Linked Art-compatible, which seems like something of an own goal.

On the other hand, embedding meaning in identifiers rapidly leads to problems as entities' properties change (and strikes me as not being very 'cool'). We did, originally, start with a code (rather than a string) that embodied an entity type, but decided for various reasons that it was unsustainable. Frankly, I think Linked Art have a bit of a cheek imposing an identifier structure on contributing institutions.

That said, if it really is required: perhaps we could write an entity type into the PID path on output when we're working in a Linked Art context, and draw up URL re-writes that would quietly drop them again from for all incoming requests? This is something we'll need to discuss with Rob; I think we're coming to a point where the four of us (and possibly Joe) will need to sit down together.

richardofsussex commented 1 year ago

I've created the beginnings of a unified XSLT transform, which will have three 'modes' (in the XSLT sense) of operation: Linked Art, BIBFRAME and JSKOS. I've got it working for objects, packages and agents (the latter two were originally done in Jolt). This should give us a more consistent set of outputs. I was able to re-use existing code in the combined setup, so it was quick to do.

RGShepherd commented 1 year ago

@jpadfield , I'm starting to go through Richard's test outputs one-by-one, but as our relevant Linked Data expert, your input is pretty vital here. Please can you do likewise? We should have an issue per entity type for feedback.

RGShepherd commented 1 year ago

I think packages are done - many thanks!

jpadfield commented 1 year ago

I am back now, time allowing I am now working through: https://github.com/national-gallery/NG-CIIM/tree/main/Rosetta/JOLT/outputs

richardofsussex commented 1 year ago

Can I suggest that you look instead at https://github.com/national-gallery/NG-CIIM/tree/main/Rosetta/XSLT/outputs? This is where all the most recent work has been done. It's the only place you will find object records, and I have re-worked the JSKOS outputs so that we can generate all the results using a single XSLT transform. (This may be necessary if we don't include the entity type in the NG Linked Data URLs.) I've given up on JOLT as a practical mechanism for generating our outputs, though I remain willing to look at alternatives if they are sufficiently capable.

I've just hit an issue when re-generating the agent records, which I'll look into now.

jpadfield commented 1 year ago

Several of the XSLT outputs seem to be Elastic search results rather than Linked.art - which ones do you want me to check?

richardofsussex commented 1 year ago

Sorry: my mistake. I was busy uploading from the 'source' folder not the 'results' one. Hopefully I have now replaced them all with actual outputs - please let me know of any I have missed.

richardofsussex commented 1 year ago

Just to update on the overall approach: I now have a single batch file which runs all the test records listed above through the generic CIIM-7 XSLT transform. This means I can quickly cross-check for side-effects and reversions as the XSLT itself is developed.

richardofsussex commented 1 year ago

Luke's comment about including XSLT as a potential input mechanism for Rosetta made me realise that we need to be clear about the distinction between input to and output from the CIIM. On reflection, I don't think that an XSLT option adds much on the ingest side, especially if Proteus lives up to its promise. One issue is that of scalability: XSLT uses an in-memory model, so importing massive files becomes an issue, since they have to be parsed and become a DOM (as does the stylesheet and the output). From memory, Jolt is an in-memory process too. I've pointed out to Luke that a streaming model for Proteus would be a good idea, given that ingests will typically be a set of sibling records which can each be treated independently of the others.

On the output side, we have agreed that our transform will only ever process one record at a time, so the scalability issue doesn't arise, whatever transform approach we adopt.

richardofsussex commented 1 year ago

As we seem to reached a point where further data improvement and advice (e.g. on Bibframe) is needed, I have updated my active XSLT script files in this archive. I've done it by editing 'in place', which hopefully means that we will have a history of their evolution.

RGShepherd commented 1 year ago

Notes re Proteus following meting with K-Int 30/10/2023:

There is no XSL handler built in Rosetta just yet
Proteus can call out to linked records to retrieve data, without the need to rewrite the indexes
XML (held in memory in both cases) is a significantly larger record to hold in memory to process than JSON is
Proteus templates can be changed to work in sequence

K-Int could take one of Richard's XSL templates and convert it to Proteus, and then take a view on how best to convert the remainder

richardofsussex commented 1 year ago

I don't think the memory footprint issue is a material concern on output, since only single records are being output. No-one is using, or proposing, the use of XSLT for ingest.

It should be possible to plug an existing XSLT 3.0 processor into the Rosetta framework - no need to develop one.

My XSLT now has a single entry point - ciim7-generic.xsl - which includes a generic library file and three format-specific files. It deduces the target type from the input itself, and then applies appropriate [XSLT] 'modes' to achieve the desired result. I'm happy to explain in more detail to anyone from K-Int who is willing to have a go at rendering it as a Proteus script.

national-gallery / NG-CIIM

Overall strategy #17

Some notes

Output review

URL strategy

Target formats

Test data