Closed barbarahui closed 6 months ago
Info on the duplicate records:
26976 asset-library/UCM/Merced County Historical Society (200 items) --> has an updated filepath to asset-library/UCM/Merced County Historical Society/Merced Falls 27358 /asset-library/UCM/Merced County Historical Society/Sheet Music Collection (46 items)
These are two distinct collections. These ARK conflicts should be resolved with a reharvest to the new single index.
harvest extra data
filepath in Nuxeo to asset-library/UCM/Merced County Historical Society/Merced Falls2 records with the same ark: ark:/13030/c8445k8n
26867 http://www.luna.blackgold.org/luna/servlet/oai blackgold11
26866 http://www.luna.blackgold.org/luna/servlet/oai blackgold22
28121 https://digital.library.ucla.edu/catalog/oai metadataPrefix=oai_dpla&set=member_of_collection_ids_ssim:q0v2f41z-89112
28125 (13 records) https://digital.library.ucla.edu/catalog/oai metadataPrefix=oai_dpla&set=member_of_collection_ids_ssim:s7nf2000zz-89112
28121 (246 records) https://digital.library.ucla.edu/catalog/oai metadataPrefix=oai_dpla&set=member_of_collection_ids_ssim:q0v2f41z-89112
27768 (5 records) https://digital.library.ucla.edu/catalog/oai metadataPrefix=oai_dpla&set=member_of_collection_ids_ssim:k13t1n-89112
create_stage_index
will print a warning. Harvest operators will review the logs to check for warnings, and also review item counts to check for any item loss; to report any issues to data providers as needed.These look like records with the same ark
27926 http://lib-metadata.ucsd.edu/solr/blacklight collections_tesim:bb2022061d
27323 http://lib-metadata.ucsd.edu/solr/blacklight/ collections_tesim:bb8605360n
Note: Example of a record that is part of both sets: https://library.ucsd.edu/dc/object/bb1716281n
26176 http://lib-metadata.ucsd.edu/solr/blacklight/ collections_tesim:bb13322220
26900 http://lib-metadata.ucsd.edu/solr/blacklight/ collections_tesim:bb8228040z
create_stage_index
will print a warning. Harvest operators will review the logs to check for warnings, and also review item counts to check for any item loss; to report any issues to data providers as needed.@christinklez Can you try re-running the create_index_stage
task for collection 28121?
@barbarahui Re-ran only the create_stage_index
task, resulting in a red square with this message:
[2024-05-09, 22:02:44 UTC] {{logging_mixin.py:150}} INFO - deleted records with collection_id `28121` from index `rikolti-stg-combined-20240508151215`
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz000100hh', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz000100hh]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz000100hh]: version conflict, document already exists (current '
'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz000100j1', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz000100j1]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz000100j1]: version conflict, document already exists (current '
'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztdp', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztdp]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000ztdp]: version conflict, document already exists (current '
'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt6k', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt6k]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000zt6k]: version conflict, document already exists (current '
'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt73', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt73]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000zt73]: version conflict, document already exists (current '
'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztqb', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztqb]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000ztqb]: version conflict, document already exists (current '
'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztrv', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztrv]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000ztrv]: version conflict, document already exists (current '
'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztpt', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztpt]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000ztpt]: version conflict, document already exists (current '
'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt94', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt94]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000zt94]: version conflict, document already exists (current '
'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt8m', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt8m]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000zt8m]: version conflict, document already exists (current '
'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zv1g', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zv1g]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000zv1g]: version conflict, document already exists (current '
'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztsc', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztsc]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000ztsc]: version conflict, document already exists (current '
'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt52', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt52]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000zt52]: version conflict, document already exists (current '
'version [2])')
[2024-05-09, 22:02:45 UTC] {{taskinstance.py:1824}} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/decorators/base.py", line 220, in execute
return_value = super().execute(context)
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 181, in execute
return_value = self.execute_callable()
File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 198, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/dags/rikolti/dags/shared_tasks/indexing_tasks.py", line 61, in update_stage_index_for_collection_task
raise e
File "/usr/local/airflow/dags/rikolti/dags/shared_tasks/indexing_tasks.py", line 58, in update_stage_index_for_collection_task
update_stage_index_for_collection(collection_id, version_pages)
File "/usr/local/airflow/dags/rikolti/record_indexer/update_stage_index.py", line 17, in update_stage_index_for_collection
add_page(version_page, index)
File "/usr/local/airflow/dags/rikolti/record_indexer/add_page_to_index.py", line 105, in add_page
bulk_add(records, index)
File "/usr/local/airflow/dags/rikolti/record_indexer/add_page_to_index.py", line 43, in bulk_add
raise(
Exception: 0 errors in bulk indexing 25 records: []
[2024-05-09, 22:02:45 UTC] {{taskinstance.py:1345}} INFO - Marking task as FAILED. dag_id=harvest_collection, task_id=create_stage_index, execution_date=20240509T202553, start_date=20240509T220243, end_date=20240509T220245
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - Message sent to SNS with Message ID: fe9a5ca7-3486-51a0-9307-02107ec3ec29
[2024-05-09, 22:02:45 UTC] {{standard_task_runner.py:104}} ERROR - Failed to execute job 129040 for task create_stage_index (0 errors in bulk indexing 25 records: []; 187)
[2024-05-09, 22:02:45 UTC] {{local_task_job_runner.py:225}} INFO - Task exited with return code 1
[2024-05-09, 22:02:45 UTC] {{taskinstance.py:2653}} INFO - 0 downstream tasks scheduled from follow-on schedule check
@christinklez can you try one more time?
Done! Green squares, looks like it went through!
Here's what the log says, just as an fyi:
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz000100hh', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz000100hh]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz000100j1', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz000100j1]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztdp', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztdp]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt6k', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt6k]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt73', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt73]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztqb', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztqb]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztrv', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztrv]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztpt', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztpt]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt94', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt94]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt8m', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt8m]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zv1g', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zv1g]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztsc', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztsc]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt52', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt52]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/0.jsonl`
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_at field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_by field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - all 25 records had item_count field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - all 25 records had media_source field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/1.jsonl`
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_at field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_by field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - all 25 records had item_count field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - all 25 records had media_source field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/n1b317', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/n1b317]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/n1fs6z', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/n1fs6z]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/n16c9s', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/n16c9s]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/n1xw4k', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/n1xw4k]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
{'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/n12k6g', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/n12k6g]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - added 9 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/10.jsonl`
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - all 9 records had is_shown_at field removed
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - all 9 records had is_shown_by field removed
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - all 9 records had item_count field removed
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - all 9 records had media_source field removed
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - all 9 records had thumbnail_source field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/2.jsonl`
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_at field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_by field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - all 25 records had item_count field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - all 25 records had media_source field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/3.jsonl`
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_at field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_by field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - all 25 records had item_count field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - all 25 records had media_source field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:30 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/4.jsonl`
[2024-05-09, 22:30:30 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_at field removed
[2024-05-09, 22:30:30 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_by field removed
[2024-05-09, 22:30:30 UTC] {{logging_mixin.py:150}} INFO - all 25 records had item_count field removed
[2024-05-09, 22:30:30 UTC] {{logging_mixin.py:150}} INFO - all 25 records had media_source field removed
[2024-05-09, 22:30:30 UTC] {{logging_mixin.py:150}} INFO - all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/5.jsonl`
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_at field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_by field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - all 25 records had item_count field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - all 25 records had media_source field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/6.jsonl`
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_at field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_by field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - all 25 records had item_count field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - all 25 records had media_source field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/7.jsonl`
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_at field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_by field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - all 25 records had item_count field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - all 25 records had media_source field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/8.jsonl`
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_at field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_by field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - all 25 records had item_count field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - all 25 records had media_source field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/9.jsonl`
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_at field removed
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO - all 25 records had is_shown_by field removed
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO - all 25 records had item_count field removed
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO - all 25 records had media_source field removed
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO - all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO -
Review indexed records at: https://rikolti-data.s3.us-west-2.amazonaws.com/index.html#28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/
Or on opensearch at: https://search-rikolti-2-xxbcriyfw5iqysaj7p3fhhscae.us-west-2.es.amazonaws.com/_dashboards/app/dev_tools#/console with query:
{
"query": {
"bool": {
"filter": {
"terms": {
"collection_url": [
28121
]
}
}
}
}
}
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO - Message sent to SNS with Message ID: e0d91176-e06c-5930-ad14-4fc1b0a5ac3f
[2024-05-09, 22:30:33 UTC] {{python.py:183}} INFO - Done. Returned value was: None
[2024-05-09, 22:30:33 UTC] {{taskinstance.py:1345}} INFO - Marking task as SUCCESS. dag_id=harvest_collection, task_id=create_stage_index, execution_date=20240509T202553, start_date=20240509T223025, end_date=20240509T223033
[2024-05-09, 22:30:33 UTC] {{local_task_job_runner.py:225}} INFO - Task exited with return code 0
[2024-05-09, 22:30:33 UTC] {{taskinstance.py:2653}} INFO - 0 downstream tasks scheduled from follow-on schedule check
While combining the existing individual collection indices into one large stage index, I noticed that 99 records are in 2 collections.
46 duplicates 26976 asset-library/UCM/Merced County Historical Society (200 items) 27358 /asset-library/UCM/Merced County Historical Society/Sheet Music Collection (46 items) Collection 27358 is a subset of 26976
1 duplicate 26867 http://www.luna.blackgold.org/luna/servlet/oai blackgold~1~1 26866 http://www.luna.blackgold.org/luna/servlet/oai blackgold~2~2 2 records with the same ark: ark:/13030/c8445k8n
33 duplicates 27926 http://lib-metadata.ucsd.edu/solr/blacklight collections_tesim:bb2022061d 27323 http://lib-metadata.ucsd.edu/solr/blacklight/ collections_tesim:bb8605360n These look like records with the same ark
1 duplicate 26176 http://lib-metadata.ucsd.edu/solr/blacklight/ collections_tesim:bb13322220 26900 http://lib-metadata.ucsd.edu/solr/blacklight/ collections_tesim:bb8228040z
13 duplicates 28121 28125
5 duplicates 28121 27768
We will need to resolve this problem so that calisphere-id is unique and reharvest these collections. The new combined index contains only one copy of each of the duplicate records (whichever it encountered 2nd).