sul-dlss / libsys-airflow

Airflow DAGS for migrating and managing ILS data into FOLIO along with other LibSys workflows
Apache License 2.0
5 stars 0 forks source link

Get digital bookplate metadata JSON documents #1177

Closed shelleydoljack closed 1 month ago

shelleydoljack commented 2 months ago

Blocked by #1175 For each member druid of the digital bookplate collection, lookup the public JSON document at purl.stanford.edu and parse out the fields we will store in our bookplates table. This Bookplates class from our old process should be used as a guide to the fields we need from the JSON. The output should be sending parsed data to a task that will store data in the bookplates table. If the public JSON document does not contain all the fields necessary for us to create 979's (like missing image filename for instance), pass this along via XCOM to be reported in an email.

Example JSON for the ABBOTT bookplate. The fields we need: fund name description.identifier.value where displayLabel="Symphony Fund Name"

"identifier": [
      {
        "structuredValue": [],
        "parallelValue": [],
        "groupedValue": [],
        "value": "1065089-106-KAUNB",
        "identifier": [],
        "displayLabel": "Symphony Fund ID",
        "note": [],
        "appliesTo": []
      },
      {
        "structuredValue": [],
        "parallelValue": [],
        "groupedValue": [],
        "value": "ABBOTT",
        "identifier": [],
        "displayLabel": "Symphony Fund Name",
        "note": [],
        "appliesTo": []
      },
      {
        "structuredValue": [],
        "parallelValue": [],
        "groupedValue": [],
        "value": "KAUNB",
        "identifier": [],
        "displayLabel": "Fund Number",
        "note": [],
        "appliesTo": []
      },
      {
        "structuredValue": [],
        "parallelValue": [],
        "groupedValue": [],
        "value": "ws066yy0421_00_0001.jp2",
        "identifier": [],
        "displayLabel": "Image filename",
        "note": [],
        "appliesTo": []
      }
    ]

druid externalIdentifier field: "externalIdentifier": "druid:ws066yy0421", image filename This appears in the description.identifiers list and the structural.contains list. Per Andrew,

In the JSON, the description will have all the fields that map to MODS and the structural will have the fields that are in contentMetadata.

From our previous process, we used the <contentMetadata> to get to the image filename. We should pull from the JSON equivalent, so from the structural.contains.structural.contains list:

"structural": {
    "contains": [
      {
        "type": "https://cocina.sul.stanford.edu/models/resources/image",
        "externalIdentifier": "https://cocina.sul.stanford.edu/fileSet/ws066yy0421-ws066yy0421_1",
        "label": "Image 1",
        "version": 5,
        "structural": {
          "contains": [
            {
              "type": "https://cocina.sul.stanford.edu/models/file",
              "externalIdentifier": "https://cocina.sul.stanford.edu/file/ws066yy0421-ws066yy0421_1/ws066yy0421_00_0001.jp2",
              "label": "ws066yy0421_00_0001.jp2",
              "filename": "ws066yy0421_00_0001.jp2",
              "size": 297491,
              "version": 5,
              "hasMimeType": "image/jp2",
              "sdrGeneratedText": false,
              "correctedForAccessibility": false,
              "hasMessageDigests": [
                {
                  "type": "sha1",
                  "digest": "ff56bbe8c4de850b7d1111cafe60b527999f5fea"
                },
                {
                  "type": "md5",
                  "digest": "fc30918e85556b0a2b6ee2a77d46c8d1"
                }
              ],
              "access": {
                "view": "world",
                "download": "world",
                "controlledDigitalLending": false
              },
              "administrative": {
                "publish": true,
                "sdrPreserve": false,
                "shelve": true
              },
              "presentation": {
                "height": 1392,
                "width": 1081
              }
            }
          ]
        }
      }
    ],
    "hasMemberOrders": [],
    "isMemberOf": [
      "druid:nh525xs4538"
    ]
  }

I think we will always take the first in those list. I will confirm.

shelleydoljack commented 2 months ago

If getting JSON data from purl fails or writing to table fails, then add to a failures dict for retry. "failures": [{"druid": "cannot fetch"}, {"druid": "cannot insert into table (missing req'd field)"}]. Successful writes to table should report out new and updated data, add to XCOM with something like this:

{ "successes": 
    {
        "new": [
            {
                "fund_name": "ABBOTT",
                "druid": "ab123cd4567",
                "filename": "image_ab123cd4567.jp2",
                "title": "Title",
            }
        ],
        "updated": [
            {
                "fund_name": "ABBOTT",
                "druid": "ab123cd4567",
                "filename": "image_ab123cd4567.jp2",
                "title": "Changed Title",
                "reason": "title changed",
            }
        ],
    }
}
shelleydoljack commented 2 months ago

I asked Andrew if the first in the list of structural.contains.structural.contains in the public Cocina JSON was equivalent to the XML path contentMetadata/resource[@sequence='1']/file/@id"] (where we used to get the image filename) and his response was:

Yes, that’s correct. The JSON doesn’t store the sequence number as a specific field but the order in the JSON is equivalent to the sequence.

So we should get the image filename from that part of the JSON.