isamplesorg / isamples_inabox

Provides functionality intermediate to a collection and central
0 stars 1 forks source link

How to reconcile the two different OpenContext record formats #304

Open dannymandel opened 10 months ago

dannymandel commented 10 months ago

The OpenContext JSON format has changed, but not all of the records we have are in the new format. For example, here's one in the old format:

{
"uri": "http://opencontext.org/subjects/f423496d-c695-46da-97c5-aacc89553f69", 
"label": "8110546", 
"Creator": [
{"id": "http://opencontext.org/persons/265a123e-3941-41b2-b309-5c4dd8208250", "label": "Meredith S. Chesson"}, 
{"id": "http://opencontext.org/persons/35876789-332c-4e26-89e4-1c30c9b6a0c2", "label": "R. Thomas Schaub"}, 
{"id": "http://opencontext.org/persons/f276c41d-f3c6-456c-b1c1-63325f52f37f", "label": " Walter E. Rast"}]
,
"updated": "2021-06-28T02:11:48Z",
"latitude": 31.13142,
"longitude": 35.52837,
"published": "2019-10-05T00:00:00Z",
"context uri": "http://opencontext.org/subjects/7c767465-af87-49ef-97a2-3d77e2885eea", 
"late bce/ce": -2550.0, 
"project uri": "http://opencontext.org/projects/685a86a5-ed68-4f92-8c50-d2dfb8f20995", 
"citation uri": "https://n2t.net/ark:/28722/k28d0b21r", 
"early bce/ce": -2850.0, 
"context label": "Jordan/Numayra/Unit SE 9-1/Locus 22", 
"item category": "Pottery", 
"project label": "Early Bronze Age Numayra"
}

and one in the new:

 {
 "uri": "http://opencontext.org/subjects/a5c171f9-9403-4e55-8f64-adf1047b703d", 
 "href": "https://opencontext.org/subjects/a5c171f9-9403-4e55-8f64-adf1047b703d",
 "icon": "https://opencontext.org/static/oc/icons-v2/object-icon-draft-2.svg", 
 "label": "Reg. 697", 
 "Creator": ["Joanna Smith"], 
 "License": ["Attribution 4.0 International (CC BY 4.0)"], 
 "updated": "2022-10-23T07:15:31Z", 
 "latitude": 35.034889, 
 "longitude": 32.421841, 
 "published": "2017-01-30T22:57:28Z", 
 "Consists of": ["glass (material)"], 
 "late bce/ce": 1000.0, 
 "citation uri": "https://n2t.net/ark:/28722/k26h4xk1f", 
 "context href": "https://opencontext.org/subjects/9d20d284-1cc2-4381-8940-0de1bfc10d87", 
 "early bce/ce": -800.0, 
 "project href": "https://opencontext.org/projects/766d9fd5-2175-41e3-b7c9-7eba6777f1f0", 
 "Creator [URI]": ["http://opencontext.org/persons/6c34c167-1a30-4820-956d-474c73c07085"], 
 "License [URI]": ["https://creativecommons.org/licenses/by/4.0"], 
 "context label": "Europe/Cyprus/Polis Chrysochous/E.F2:R09", 
 "item category": "Object", 
 "project label": "Excavations at Polis", 
 "Consists of [URI]": ["https://vocab.getty.edu/aat/300010797"], 
 "inorganic material": ["glass (material)"], 
 "inorganic material [URI]": ["https://vocab.getty.edu/aat/300010797"], 
 "inorganic material [getty-aat-300010360]": ["glass (material)"], 
 "inorganic material [getty-aat-300010360] [URI]": ["https://vocab.getty.edu/aat/300010797"]
 }

notice the difference between the Creator values. There is this method

    def _get_oc_str_or_dict_item_label(self, str_or_dict):
        """A utility method to get a dictionary label or if a string, return the string"""
        # This is a bit messy, but it should be a bit forgiving if the OC API returns
        # dict or string items for certain record attributes.
        if isinstance(str_or_dict, dict):
            # this item is a dictionary.
            return str_or_dict.get("label")
        elif isinstance(str_or_dict, str):
            return str_or_dict
        return str_or_dict

which should probably help. However, there are things like keywords where we are using the new Getty metadata in the OC API that just isn't present for the older records. If we refetch everything will it be available? Should we just no-op on the older records? Should we keep two copies of the Transformer around?

@ekansa @datadavev need some guidance here.

datadavev commented 10 months ago

Generally we should use whatever is exposed by the repository when requesting a record by its identifier.

I suspect the mix of formats may be an oversight on OC (perhaps a cache issue?), but I think if there's agreement on a new format then that should be the one we use. Any records not conforming to the new format should be treated as invalid by iSamples (if they have not already been harvested).

So basically:

for pid in get_pids_from_oc:
  if pid in isamples:
    if record has not changed:
      continue
  get record
  if record is valid:
    insert or update record in isamples
dannymandel commented 10 months ago

We should refetch all the records, preferably after Eric makes a new iSamples specific API.

dannymandel commented 9 months ago

Keeping open as a task, the task is to refetch all the OpenContext records and reindex them on the iSamples side.