Closed arcadiafalcone closed 2 years ago
@ndushay @andrewjbtw I moved this to the top of the SDR Imp. Backlog. Analysis/discussion is needed to figure out how to support this. It'd be great to get this moving this week if we can.
Pasting in the general options as I see them from my earlier comments on Slack. I've adapted them for my conclusion that I would like to not be whoever implements this if it goes the Argo route. But I would like to know when it's going to happen if it's via bulk action so I know to watch out for disruptions to accessioning throughput.
The options:
The first two options would involve a ticket for someone in infrastructure. I think whoever picked up the ticket would work out a way to implement it, since there are probably other options that use the internal SDR tooling (rails console/ruby scripting?)
For option 3, it would take longer but would be a test case of how well the system can scale. But in that case, I’d ask whoever picks it up to either to split things into batches (not all 144k at once). I've run batch reindexes and tag exports/imports of 150-200k, but neither of those bulk actions generates accessioning activity.
My two concerns with the bulk action are:
@andrewjbtw Thanks for pasting/writing that here! I've assigned this issue to myself and I'll make a proposal this week.
@arcadiafalcone Are all the erroneous W3CDTF dates under description/event?
@arcadiafalcone I'm going to try scripting this using the SDR API (pending my question immediately above). I started by testing it against this object: https://argo.stanford.edu/view/druid:bh614kx9420 Can you view the object's cocina and confirm it looks OK?
@arcadiafalcone Are all the erroneous W3CDTF dates under description/event?
Yes.
@arcadiafalcone I'm going to try scripting this using the SDR API (pending my question immediately above). I started by testing it against this object: https://argo.stanford.edu/view/druid:bh614kx9420 Can you view the object's cocina and confirm it looks OK?
Yes, that looks correct.
@arcadiafalcone OK, awesome. Should I next try the script against a sizeable subset of the data? Say 100 or 1,000 items? If so, do you care how many I remediate or which ones?
Sounds good to me - random sample is fine.
@arcadiafalcone OK, I started by doing these 10 items. Do you mind checking a few?
druid:yz456rf4634
druid:fj176dv5658
druid:cg287jy3360
druid:ct387pm1156
druid:gz661hz6448
druid:kg979wm6693
druid:dt501sx6574
druid:dy523yv7505
druid:gj997sv1282
druid:dh306qw5388
@mjgiarlo Looks good. Full steam ahead!
The first 1011 items have been remediated.
@arcadiafalcone This is now finished (for now). Here are the results of the latest date reports.
Runs clean!
One baddie:
item_druid,collection_druid,catkey,invalid_values
druid:ky899rv1161,druid:dg570gb2904,,13499
Fifteen baddies:
item_druid,collection_druid,catkey,invalid_values
druid:cf339rp7493,druid:mq209xn7521,,1992-00
druid:cv691bb8203,druid:qh156zc5648,,1946-00
druid:fm282tp6604,druid:mq209xn7521,,1995-00
druid:fz422dd2446,druid:qh156zc5648,,1946-00
druid:kz098xs4316,druid:mq209xn7521,,1995-00
druid:mc635sf9927,druid:mq209xn7521,,1972-00
druid:ms336dn2815,druid:mt839rq8746,,1945-11-20TXX:XX+01:00
druid:pk357sn8094,druid:mt839rq8746,,1945-11-20TXX:XX+01:00
druid:rd065zy4995,druid:mt839rq8746,,1945-11-20TXX:XX+01:00
druid:wd140wy2812,druid:ny315jz4678,,2002-00
druid:zh724bv4125,,,11/16/17
druid:qk058jq2233,druid:md919gh6774,,6/1/22
druid:sm043zf7254,druid:md919gh6774,,4/29/22
druid:fb002mq9407,druid:md919gh6774,,6/8/22
druid:jj716hx9049,druid:md919gh6774,,6/8/22
Closing this issue for now. Let me/us know how you'd like to proceed on the sixteen items above which is currently all that's blocking turning on date validation in Cocina.
Cc: @sul-dlss/infrastructure-team
In order to turn on date validiation, all 144,426 records in the Catalhoyuk Image Collection (cj699jr4513) need to have the date encoding removed so that they pass validation. This requires figuring out how best to do a batch update this large.
Before (from bb002mx0527):
After: