sul-dlss / cocina-models

Cocina repository data model (implemented in Ruby)
https://sul-dlss.github.io/cocina-models/
3 stars 0 forks source link

Remediate 144K records in the Catalhoyuk collection #496

Closed arcadiafalcone closed 2 years ago

arcadiafalcone commented 2 years ago

In order to turn on date validiation, all 144,426 records in the Catalhoyuk Image Collection (cj699jr4513) need to have the date encoding removed so that they pass validation. This requires figuring out how best to do a batch update this large.

Before (from bb002mx0527):

    "event": [
      {
        "date": [
          {
            "value": "2013-08-06T09:14:41",
            "type": "creation",
            "status": "primary",
            "encoding": {
              "code": "w3cdtf"
            }
          }
        ]
      }
    ],

After:

    "event": [
      {
        "date": [
          {
            "value": "2013-08-06T09:14:41",
            "type": "creation",
            "status": "primary",
          }
        ]
      }
    ],
mjgiarlo commented 2 years ago

@ndushay @andrewjbtw I moved this to the top of the SDR Imp. Backlog. Analysis/discussion is needed to figure out how to support this. It'd be great to get this moving this week if we can.

andrewjbtw commented 2 years ago

Pasting in the general options as I see them from my earlier comments on Slack. I've adapted them for my conclusion that I would like to not be whoever implements this if it goes the Argo route. But I would like to know when it's going to happen if it's via bulk action so I know to watch out for disruptions to accessioning throughput.

The options:

The first two options would involve a ticket for someone in infrastructure. I think whoever picked up the ticket would work out a way to implement it, since there are probably other options that use the internal SDR tooling (rails console/ruby scripting?)

For option 3, it would take longer but would be a test case of how well the system can scale. But in that case, I’d ask whoever picks it up to either to split things into batches (not all 144k at once). I've run batch reindexes and tag exports/imports of 150-200k, but neither of those bulk actions generates accessioning activity.

My two concerns with the bulk action are:

mjgiarlo commented 2 years ago

@andrewjbtw Thanks for pasting/writing that here! I've assigned this issue to myself and I'll make a proposal this week.

mjgiarlo commented 2 years ago

@arcadiafalcone Are all the erroneous W3CDTF dates under description/event?

mjgiarlo commented 2 years ago

@arcadiafalcone I'm going to try scripting this using the SDR API (pending my question immediately above). I started by testing it against this object: https://argo.stanford.edu/view/druid:bh614kx9420 Can you view the object's cocina and confirm it looks OK?

arcadiafalcone commented 2 years ago

@arcadiafalcone Are all the erroneous W3CDTF dates under description/event?

Yes.

arcadiafalcone commented 2 years ago

@arcadiafalcone I'm going to try scripting this using the SDR API (pending my question immediately above). I started by testing it against this object: https://argo.stanford.edu/view/druid:bh614kx9420 Can you view the object's cocina and confirm it looks OK?

Yes, that looks correct.

mjgiarlo commented 2 years ago

@arcadiafalcone OK, awesome. Should I next try the script against a sizeable subset of the data? Say 100 or 1,000 items? If so, do you care how many I remediate or which ones?

arcadiafalcone commented 2 years ago

Sounds good to me - random sample is fine.

mjgiarlo commented 2 years ago

@arcadiafalcone OK, I started by doing these 10 items. Do you mind checking a few?

druid:yz456rf4634
druid:fj176dv5658
druid:cg287jy3360
druid:ct387pm1156
druid:gz661hz6448
druid:kg979wm6693
druid:dt501sx6574
druid:dy523yv7505
druid:gj997sv1282
druid:dh306qw5388
arcadiafalcone commented 2 years ago

@mjgiarlo Looks good. Full steam ahead!

mjgiarlo commented 2 years ago

The first 1011 items have been remediated.

mjgiarlo commented 2 years ago

@arcadiafalcone This is now finished (for now). Here are the results of the latest date reports.

ISO8601

Runs clean!

EDTF

One baddie:

item_druid,collection_druid,catkey,invalid_values
druid:ky899rv1161,druid:dg570gb2904,,13499

W3CDTF

Fifteen baddies:

item_druid,collection_druid,catkey,invalid_values
druid:cf339rp7493,druid:mq209xn7521,,1992-00
druid:cv691bb8203,druid:qh156zc5648,,1946-00
druid:fm282tp6604,druid:mq209xn7521,,1995-00
druid:fz422dd2446,druid:qh156zc5648,,1946-00
druid:kz098xs4316,druid:mq209xn7521,,1995-00
druid:mc635sf9927,druid:mq209xn7521,,1972-00
druid:ms336dn2815,druid:mt839rq8746,,1945-11-20TXX:XX+01:00
druid:pk357sn8094,druid:mt839rq8746,,1945-11-20TXX:XX+01:00
druid:rd065zy4995,druid:mt839rq8746,,1945-11-20TXX:XX+01:00
druid:wd140wy2812,druid:ny315jz4678,,2002-00
druid:zh724bv4125,,,11/16/17
druid:qk058jq2233,druid:md919gh6774,,6/1/22
druid:sm043zf7254,druid:md919gh6774,,4/29/22
druid:fb002mq9407,druid:md919gh6774,,6/8/22
druid:jj716hx9049,druid:md919gh6774,,6/8/22

Closing this issue for now. Let me/us know how you'd like to proceed on the sixteen items above which is currently all that's blocking turning on date validation in Cocina.

Cc: @sul-dlss/infrastructure-team