Munged array contents - Githubissues

richardofsussex commented 1 year ago

Where two or more arrays in the source are mapped to the same Linked Art key, the contents of each member of these arrays are merged by default, creating nonsense entries. Here is an example, now that I am mapping both bibliography and description to referred_to_by: Here the two type="LinguisticObject" members are the result of a typo in the CIIM output; the third one is the type from the first description entry. This first member of referred_to_by contains all the citations (which suggests a possible way forward), plus the first description's value. The second member of referred_to_by contains a single description entry. The 'way forward' might be to ensure that we have a two-level structure for each of these arrays, and then assign an explicit array offset to each of them. (The two-level structure design is already the case for the citations.)

jpadfield commented 1 year ago

Your solution is not clear, I will need to look at the citation example.

I think one of the issues here is that types such as "cic text draft" do not really mean anything without a clear description. Both strings are LinguisticObjects they just have different sub-classes or purposes.

Linked.art does use nested types:

"referred_to_by": [
    {
      "type": "LinguisticObject",
      "classified_as": [
        {
          "id": "http://vocab.getty.edu/aat/300435416",
          "type": "Type",
          "_label": "Description",
          "classified_as": [
            {
              "id": "http://vocab.getty.edu/aat/300418049",
              "type": "Type",
              "_label": "Brief Text"
            }
          ]
        }
      ],
      "content": "The Example Painting is a great example of exampleness."
    }
  ]

This comes back to some form of vocabulary question or at least a skos description of the terms being used ...

jpadfield commented 1 year ago

From a user point of view, if we include a potential series of "named" descriptions how is one supposed to know which one to use? Some descriptions may actually have the same "types" and might just be different versions of the standard description, this would need to be clear ..

We could do this with a "current" flag, but it might also be useful to add a date created for each description? So then it would be easy to just use the most recent one or even discuss a specific previous one, without us having to worry about versioning texts ...

Linked.Art is an evolving project, so if things do not fit we can explore options.

richardofsussex commented 1 year ago

There is author and date info associated with all the descriptions apart from the first, draft, one. I can include these in the Linked Art output.

jpadfield commented 1 year ago

@RGShepherd what are your thoughts here, are you happy with using dates to version descriptions?

Also, should we link them to a purpose rather than just a type - so general description written for the CIC etc ... rather than the "cic text"

RGShepherd commented 1 year ago

There are a finite number of possible text types, so these should be mappable to an agreed vocabulary of types / identifiers. Does Jolt let us process differently / add values based on ....type?

Possible values are:

long text
short text
cic text draft
cic original text

I'm less sure about versioning texts: we'd probably aim to have only a current text in the published indexes. This will be one for @rob-tice and me to resolve.

@richardofsussex when you say 'Linked Data_ready', do you mean that things like text types have external ids assigned to them? If so - no; that's part of this job. Can I suggest we keep a list of places where these will be needed, then review them in bulk?

richardofsussex commented 1 year ago

Does Jolt let us process differently / add values based on ....type?

If you mean "based on the value of the type key", then IIUC the answer is a firm "no". Jolt only lets you test and act on the value of keys, not data. As_ it happens, you also have a key e.g. "cic_text_draft" containing a (redundant?) copy of the text, so one could key (as it were) off that. But maybe that's just a lucky chance - I retain doubts as to Jolt's ability to deliver the complete Linked Art output we want to produce (once we have decided what that is). Are you on the linked-art list? If so, you will have seen Martynas Jusevicius' instant response when I asked about anyone's Jolt experience.

richardofsussex commented 1 year ago

@richardofsussex when you say 'Linked Data_ready', do you mean that things like text types have external ids assigned to them? If so - no; that's part of this job. Can I suggest we keep a list of places where these will be needed, then review them in bulk?

I was thinking as much about the format of text (e.g. of dates) in relation to Linked Art requirements. For example, one thing I've already given up on trying to do is converting year-only dates to y-m-d format for the begin-of-the-begin and end-of-the-end fields. Jolt simply isn't up to it.

However, linking to home-grown authority lists and pulling in URLs would also be a sensible thing to want to be able to do. Again, Jolt offers nothing along these lines.

jpadfield commented 1 year ago

Can a date transformation occur before they reach jolt?

richardofsussex commented 1 year ago

This is the sort of thing I'm trying to get a handle on - what the complete processing chain might/could be. Thus far the firm suggestion is that a single short sharp Jolt will sort everything out, and as you will gather I'm increasingly sceptical about this. (Conversely, I think a single XSLT 3.0 transform could do the job. I've just done a "Hello world!" test where I read object-4668 into XSLT using the json-to-xml function, and immediately wrote it out using xml-to-json. This exercise wasn't the stunning success I was hoping for, but it does suggest that the basic approach is viable.)

If we're going to need pre-processing and/or post-processing, why not do the whole job in XSLT in one go?

jpadfield commented 1 year ago

I think this relates to #5 - I think it might be tricky if we put too much into this transform - things like dates should be consistent in all forms of the data we publish.

This jolt process is looking to express our data in the form of Linked.art - if the data we are passing to it is not ready for this type of transform, and is not specific or understood enough it might be that the actual transformation or updating of values should happen at the root dataset and not a particular flavour or data presentation?

RGShepherd commented 1 year ago

I think we need a steer from @rob-tice here.

RGShepherd commented 1 year ago

@richardofsussex, are your processing concerns now met by using XSL rather than Jolt?

The question of areas where we lack external identifiers is down to shortcomings in the TMS architecture, but can be addressed - see https://github.com/national-gallery/NG-CIIM/issues/13#issuecomment-1745011697

richardofsussex commented 1 year ago

Yes, thank you. I'm perfectly happy that I can do whatever you need me to do with the data, using XSLT 3.0.

RGShepherd commented 1 year ago

So I think, in the light of Richard's comments, we can take this issue as resolved.

national-gallery / NG-CIIM

Munged array contents #3