ucldc / rikolti

calisphere harvester 2.0
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

[bug?] `create_stage_index` reports 12 ARK conflict version errors; when checking Nuxeo, these ARKs are seemingly unique #929

Closed christinklez closed 6 months ago

christinklez commented 6 months ago

Brief Summary

Rikolti is saying there are 12 ARK conflicts, but when I look into the records in Nuxeo, I don't observe duplicate ARKs.

When I searched Nuxeo for each of these ARKs, they seem to only be assigned to one record.

All ARK conflicts are found within the project folder /S01185: https://nuxeo.cdlib.org/nuxeo/nxdoc/default/a89b0165-74e7-451a-a04f-3f7430ccf792/view_documents

By the way, in a prior harvesting attempt, Rikolti wasn't picking up all the objects due to the deeply nested folder structure. Some changes were made.

PS: This error predates the mega merged index. So the ARK errors are being reported from within collection 26713

Harvesting Details

Here is the list of ARKs and the corresponding Nuxeo object

ark:/81235/d8gt5ff52 https://nuxeo.cdlib.org/nuxeo/nxdoc/default/018c9681-93ba-4083-9e37-ce9d878f20ba/view_documents ark:/81235/d8b27q040 https://nuxeo.cdlib.org/nuxeo/nxdoc/default/b6192adb-10ad-4b3b-af56-6f604bd9a669/view_documents ark:/81235/d8qb9vd6x https://nuxeo.cdlib.org/nuxeo/nxdoc/default/627ae451-28cb-4b73-92cb-dddf4fe2f1ba/view_documents ark:/81235/d8kk94m3n https://nuxeo.cdlib.org/nuxeo/nxdoc/default/d9ca4271-8c37-445f-be86-5e9bbf0d96ea/view_documents ark:/81235/d8t43jb00 https://nuxeo.cdlib.org/nuxeo/nxdoc/default/0b2800c5-7849-4b24-8b94-034ccc1f2077/view_documents ark:/81235/d8ft8ds9w https://nuxeo.cdlib.org/nuxeo/nxdoc/default/ee6e4fce-cccb-4a73-8c8e-ccd953773264/view_documents ark:/81235/d8v11vt8c https://nuxeo.cdlib.org/nuxeo/nxdoc/default/168413e6-47e0-4f05-9d63-fae75fa616ea/view_documents ark:/81235/d8pc2th7q https://nuxeo.cdlib.org/nuxeo/nxdoc/default/7334a594-28e6-4ee6-898b-579fb13ce5ce/view_documents ark:/81235/d86970607 https://nuxeo.cdlib.org/nuxeo/nxdoc/default/fbc85600-fabc-4f14-881a-28af0dfed5fb/view_documents ark:/81235/d8xs5jr2k https://nuxeo.cdlib.org/nuxeo/nxdoc/default/0904382c-bc1b-445f-9053-f251671085f5/view_documents ark:/81235/d8d50fx2t https://nuxeo.cdlib.org/nuxeo/nxdoc/default/5fe8cab7-384b-4788-acdc-6fba97eb3d07/view_documents ark:/81235/d82j68c6g https://nuxeo.cdlib.org/nuxeo/nxdoc/default/a48ce5af-1489-4391-b549-6377c7056f5a/view_documents

Here is the error log from create_stage_index

Exception: 12 errors in bulk indexing 12 records: ['[ark:/81235/d8gt5ff52]: version conflict, document already exists (current version [1])', '[ark:/81235/d8b27q040]: version conflict, document already exists (current version [1])', '[ark:/81235/d8qb9vd6x]: version conflict, document already exists (current version [1])', '[ark:/81235/d8kk94m3n]: version conflict, document already exists (current version [1])', '[ark:/81235/d8t43jb00]: version conflict, document already exists (current version [1])', '[ark:/81235/d8ft8ds9w]: version conflict, document already exists (current version [1])', '[ark:/81235/d8v11vt8c]: version conflict, document already exists (current version [1])', '[ark:/81235/d8pc2th7q]: version conflict, document already exists (current version [1])', '[ark:/81235/d86970607]: version conflict, document already exists (current version [1])', '[ark:/81235/d8xs5jr2k]: version conflict, document already exists (current version [1])', '[ark:/81235/d8d50fx2t]: version conflict, document already exists (current version [1])', '[ark:/81235/d82j68c6g]: version conflict, document already exists (current version [1])']

barbarahui commented 6 months ago

@christinklez I realized that the change we're making to not add items if they already exist (https://github.com/ucldc/rikolti/issues/926) affects this issue. These 12 records will likewise not be created because they have already been added to the index at some point.

barbarahui commented 6 months ago

@christinklez Can you try rerunning the create_stage_index task for this collection? It should give WARNING - document already exists; not creating. messages for those 12 records, but the task should succeed (assuming there aren't any other problems). Then you can look at the collection on stage and see where those 12 records ended up, i.e. are they part of the right object.

christinklez commented 6 months ago

Yes! It's actually still churning through content_harvesting right now. I'll update you once it makes it to create_stage_index. Thank you!!

christinklez commented 6 months ago

@barbarahui This collection finished through and is now on -stage: https://calisphere-stage.cdlib.org/collections/26713/

Looking at the create_stage_index log, it doesn't have any conflicting ID messages anymore: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/log?dag_id=harvest_collection&task_id=create_stage_index&execution_date=2024-05-09T21%3A10%3A23%2B00%3A00

I think this is fine, since I didn't find any ARK ID conflicts in the Nuxeo records themselves.