national-gallery / NG-CIIM

Development of Gallery-instigated CIIM configurations and plugins; not the Gallery's CIIM itself.
0 stars 0 forks source link

Overall strategy #17

Open richardofsussex opened 1 year ago

richardofsussex commented 1 year ago

This is the current position overall. We have scripts for:

We also have sample outputs for each of these entity types - now on GitHub. Thus far there has been little review of, or comment on, these outputs. They need to be checked and 'signed off' as being sufficiently complete and adequately formatted.

Packages need updating: they lack the entity type and a _label which are expected by Linked Art. Agents don't get dates right (Jolt inadequacy) and appear in a weird order (Jolt 'feature').

We need to agree on our URL strategy, bearing in mind Linked Art's convention that the entity type should form part of the URI. We still don't have UIDs to use in the output JSON.

We should discuss XSLT vs. Jolt. My own opinion is that it is not possible to achieve a usable Linked Art result for object records using Jolt. (I have tried!) Clearly a mixed Jolt/XSLT approach isn't ideal from a maintenance point of view. It wouldn't take me long to rewrite the Jolt scripts as XSLT.

We have yet to address:

(Please suggest better/more official examples - I can only access part of the uber-spreadsheet, and refuse to fiddle with it. Are there any other entity types I have missed?)

We need to agree on the target format for the remaining entity types, and then I need to implement scripts for them.

richardofsussex commented 1 year ago

Packages Jolt script and example output now updated.

RGShepherd commented 1 year ago

Some notes

From the meeting on 2023-04-20

Output review

URL strategy

Target formats

Test data

richardofsussex commented 1 year ago

Today I've been through the object instances listed above which I hadn't looked at previously. In my test data, object-15198, object-15225 and object-16573 are not found.

RGShepherd commented 12 months ago

Today I've been through the object instances listed above which I hadn't looked at previously. In my test data, object-15198, object-15225 and object-16573 are not found.

Expected; they're history collection / loan in objects that we don't (currently) publish

richardofsussex commented 12 months ago

I've just regenerated all the object records listed above (apart from the ones I didn't find) - they are all in the folder http://richardofsussex.me.uk/ng/ciim7-output/, e.g. http://richardofsussex.me.uk/ng/ciim7-output/object-4023.json. All the NG URLs they contain should now be in the required format, e.g. https://data.ng.ac.uk/0EFC-0008-0000-0000.

richardofsussex commented 12 months ago

Just thinking about the mechanics of this URL approach. I had assumed (wrongly) that the entity type would form part of the URL, and so its value could be parsed out and used within the transform. In particular, I have written different scripts for each entity type, so without a knowledge of that, we won't know which script to apply. (In fact, it's more extreme than that: some of the scripts are JOLT, while others are XSLT, so I don't even know which processing environment to call upon.)

Is this URL design decision irrevocable?

richardofsussex commented 12 months ago

Of course, we could parse the source JSON (which the XSLT already does), and branch the processing logic based on the value of hits.hits[0]._source.@datatype.base. However, that would require us to have a single 'script' which deals with all entity types. It would have to be written either in XSLT or in K-Int's new transformation framework.

richardofsussex commented 12 months ago

We need to agree on our URL strategy, bearing in mind Linked Art's convention that the entity type should form part of the URI.

I had forgotten that the entity type forming part of the URI is a Linked Art thing, not something I just invented for convenience. Does that change our view on the form our URL patterns should take? As planned, they won't be Linked Art-compatible, which seems like something of an own goal.

RGShepherd commented 12 months ago

Of course, we could parse the source JSON (which the XSLT already does), and branch the processing logic based on the value of hits.hits[0]._source.@datatype.base. However, that would require us to have a single 'script' which deals with all entity types. It would have to be written either in XSLT or in K-Int's new transformation framework.

That seems like a sensible option, and I'm fairly sure it'll be how the output transformations work in practice.

I had forgotten that the entity type forming part of the URI is a Linked Art thing, not something I just invented for convenience. Does that change our view on the form our URL patterns should take? As planned, they won't be Linked Art-compatible, which seems like something of an own goal.

On the other hand, embedding meaning in identifiers rapidly leads to problems as entities' properties change (and strikes me as not being very 'cool'). We did, originally, start with a code (rather than a string) that embodied an entity type, but decided for various reasons that it was unsustainable. Frankly, I think Linked Art have a bit of a cheek imposing an identifier structure on contributing institutions.

That said, if it really is required: perhaps we could write an entity type into the PID path on output when we're working in a Linked Art context, and draw up URL re-writes that would quietly drop them again from for all incoming requests? This is something we'll need to discuss with Rob; I think we're coming to a point where the four of us (and possibly Joe) will need to sit down together.

richardofsussex commented 12 months ago

I've created the beginnings of a unified XSLT transform, which will have three 'modes' (in the XSLT sense) of operation: Linked Art, BIBFRAME and JSKOS. I've got it working for objects, packages and agents (the latter two were originally done in Jolt). This should give us a more consistent set of outputs. I was able to re-use existing code in the combined setup, so it was quick to do.

RGShepherd commented 12 months ago

@jpadfield , I'm starting to go through Richard's test outputs one-by-one, but as our relevant Linked Data expert, your input is pretty vital here. Please can you do likewise? We should have an issue per entity type for feedback.

RGShepherd commented 11 months ago

I think packages are done - many thanks!

jpadfield commented 11 months ago

I am back now, time allowing I am now working through: https://github.com/national-gallery/NG-CIIM/tree/main/Rosetta/JOLT/outputs

richardofsussex commented 11 months ago

Can I suggest that you look instead at https://github.com/national-gallery/NG-CIIM/tree/main/Rosetta/XSLT/outputs? This is where all the most recent work has been done. It's the only place you will find object records, and I have re-worked the JSKOS outputs so that we can generate all the results using a single XSLT transform. (This may be necessary if we don't include the entity type in the NG Linked Data URLs.) I've given up on JOLT as a practical mechanism for generating our outputs, though I remain willing to look at alternatives if they are sufficiently capable.

I've just hit an issue when re-generating the agent records, which I'll look into now.

jpadfield commented 11 months ago

Several of the XSLT outputs seem to be Elastic search results rather than Linked.art - which ones do you want me to check?

richardofsussex commented 11 months ago

Sorry: my mistake. I was busy uploading from the 'source' folder not the 'results' one. Hopefully I have now replaced them all with actual outputs - please let me know of any I have missed.

richardofsussex commented 11 months ago

Just to update on the overall approach: I now have a single batch file which runs all the test records listed above through the generic CIIM-7 XSLT transform. This means I can quickly cross-check for side-effects and reversions as the XSLT itself is developed.

richardofsussex commented 11 months ago

Luke's comment about including XSLT as a potential input mechanism for Rosetta made me realise that we need to be clear about the distinction between input to and output from the CIIM. On reflection, I don't think that an XSLT option adds much on the ingest side, especially if Proteus lives up to its promise. One issue is that of scalability: XSLT uses an in-memory model, so importing massive files becomes an issue, since they have to be parsed and become a DOM (as does the stylesheet and the output). From memory, Jolt is an in-memory process too. I've pointed out to Luke that a streaming model for Proteus would be a good idea, given that ingests will typically be a set of sibling records which can each be treated independently of the others.

On the output side, we have agreed that our transform will only ever process one record at a time, so the scalability issue doesn't arise, whatever transform approach we adopt.

richardofsussex commented 9 months ago

As we seem to reached a point where further data improvement and advice (e.g. on Bibframe) is needed, I have updated my active XSLT script files in this archive. I've done it by editing 'in place', which hopefully means that we will have a history of their evolution.

RGShepherd commented 9 months ago

Notes re Proteus following meting with K-Int 30/10/2023:

K-Int could take one of Richard's XSL templates and convert it to Proteus, and then take a view on how best to convert the remainder

richardofsussex commented 9 months ago

I don't think the memory footprint issue is a material concern on output, since only single records are being output. No-one is using, or proposing, the use of XSLT for ingest.

It should be possible to plug an existing XSLT 3.0 processor into the Rosetta framework - no need to develop one.

My XSLT now has a single entry point - ciim7-generic.xsl - which includes a generic library file and three format-specific files. It deduces the target type from the input itself, and then applies appropriate [XSLT] 'modes' to achieve the desired result. I'm happy to explain in more detail to anyone from K-Int who is willing to have a go at rendering it as a Proteus script.