ucldc / rikolti

calisphere harvester 2.0
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

duplicate calisphere-ids - 99 records appear in 2 collections #926

Closed barbarahui closed 6 months ago

barbarahui commented 6 months ago

While combining the existing individual collection indices into one large stage index, I noticed that 99 records are in 2 collections.

46 duplicates 26976 asset-library/UCM/Merced County Historical Society (200 items) 27358 /asset-library/UCM/Merced County Historical Society/Sheet Music Collection (46 items) Collection 27358 is a subset of 26976

1 duplicate 26867 http://www.luna.blackgold.org/luna/servlet/oai blackgold~1~1 26866 http://www.luna.blackgold.org/luna/servlet/oai blackgold~2~2 2 records with the same ark: ark:/13030/c8445k8n

33 duplicates 27926 http://lib-metadata.ucsd.edu/solr/blacklight collections_tesim:bb2022061d 27323 http://lib-metadata.ucsd.edu/solr/blacklight/ collections_tesim:bb8605360n These look like records with the same ark

1 duplicate 26176 http://lib-metadata.ucsd.edu/solr/blacklight/ collections_tesim:bb13322220 26900 http://lib-metadata.ucsd.edu/solr/blacklight/ collections_tesim:bb8228040z

13 duplicates 28121 28125

5 duplicates 28121 27768

We will need to resolve this problem so that calisphere-id is unique and reharvest these collections. The new combined index contains only one copy of each of the duplicate records (whichever it encountered 2nd).

barbarahui commented 6 months ago

Info on the duplicate records:

duplicates-info.json

christinklez commented 6 months ago

Notes on UCM's 46 duplicates

26976 asset-library/UCM/Merced County Historical Society (200 items) --> has an updated filepath to asset-library/UCM/Merced County Historical Society/Merced Falls 27358 /asset-library/UCM/Merced County Historical Society/Sheet Music Collection (46 items)

These are two distinct collections. These ARK conflicts should be resolved with a reharvest to the new single index.

To do:

christinklez commented 6 months ago

Notes on BlackGold's 1 duplicate

2 records with the same ark: ark:/13030/c8445k8n

26867 http://www.luna.blackgold.org/luna/servlet/oai blackgold11

26866 http://www.luna.blackgold.org/luna/servlet/oai blackgold22

To do:

christinklez commented 6 months ago

Notes on UCLA's duplicates

13 duplicates

28121 https://digital.library.ucla.edu/catalog/oai metadataPrefix=oai_dpla&set=member_of_collection_ids_ssim:q0v2f41z-89112

28125 (13 records) https://digital.library.ucla.edu/catalog/oai metadataPrefix=oai_dpla&set=member_of_collection_ids_ssim:s7nf2000zz-89112

5 duplicates

28121 (246 records) https://digital.library.ucla.edu/catalog/oai metadataPrefix=oai_dpla&set=member_of_collection_ids_ssim:q0v2f41z-89112

27768 (5 records) https://digital.library.ucla.edu/catalog/oai metadataPrefix=oai_dpla&set=member_of_collection_ids_ssim:k13t1n-89112

To do:

christinklez commented 6 months ago

Notes on UCSD's duplicates

These look like records with the same ark

33 duplicates

27926 http://lib-metadata.ucsd.edu/solr/blacklight collections_tesim:bb2022061d

27323 http://lib-metadata.ucsd.edu/solr/blacklight/ collections_tesim:bb8605360n

Note: Example of a record that is part of both sets: https://library.ucsd.edu/dc/object/bb1716281n

1 duplicate

26176 http://lib-metadata.ucsd.edu/solr/blacklight/ collections_tesim:bb13322220

26900 http://lib-metadata.ucsd.edu/solr/blacklight/ collections_tesim:bb8228040z

To do:

barbarahui commented 6 months ago

@christinklez Can you try re-running the create_index_stage task for collection 28121?

christinklez commented 6 months ago

@barbarahui Re-ran only the create_stage_index task, resulting in a red square with this message:

https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/log?dag_id=harvest_collection&task_id=create_stage_index&execution_date=2024-05-09T20%3A25%3A53%2B00%3A00

[2024-05-09, 22:02:44 UTC] {{logging_mixin.py:150}} INFO - deleted records with collection_id `28121` from index `rikolti-stg-combined-20240508151215`
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz000100hh', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz000100hh]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz000100hh]: version conflict, document already exists (current '
 'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz000100j1', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz000100j1]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz000100j1]: version conflict, document already exists (current '
 'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztdp', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztdp]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000ztdp]: version conflict, document already exists (current '
 'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt6k', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt6k]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000zt6k]: version conflict, document already exists (current '
 'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt73', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt73]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000zt73]: version conflict, document already exists (current '
 'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztqb', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztqb]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000ztqb]: version conflict, document already exists (current '
 'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztrv', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztrv]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000ztrv]: version conflict, document already exists (current '
 'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztpt', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztpt]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000ztpt]: version conflict, document already exists (current '
 'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt94', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt94]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000zt94]: version conflict, document already exists (current '
 'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt8m', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt8m]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000zt8m]: version conflict, document already exists (current '
 'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zv1g', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zv1g]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000zv1g]: version conflict, document already exists (current '
 'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztsc', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztsc]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000ztsc]: version conflict, document already exists (current '
 'version [2])')
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt52', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt52]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - ('[ark:/21198/zz0000zt52]: version conflict, document already exists (current '
 'version [2])')
[2024-05-09, 22:02:45 UTC] {{taskinstance.py:1824}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/decorators/base.py", line 220, in execute
    return_value = super().execute(context)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 181, in execute
    return_value = self.execute_callable()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 198, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/dags/rikolti/dags/shared_tasks/indexing_tasks.py", line 61, in update_stage_index_for_collection_task
    raise e
  File "/usr/local/airflow/dags/rikolti/dags/shared_tasks/indexing_tasks.py", line 58, in update_stage_index_for_collection_task
    update_stage_index_for_collection(collection_id, version_pages)
  File "/usr/local/airflow/dags/rikolti/record_indexer/update_stage_index.py", line 17, in update_stage_index_for_collection
    add_page(version_page, index)
  File "/usr/local/airflow/dags/rikolti/record_indexer/add_page_to_index.py", line 105, in add_page
    bulk_add(records, index)
  File "/usr/local/airflow/dags/rikolti/record_indexer/add_page_to_index.py", line 43, in bulk_add
    raise(
Exception: 0 errors in bulk indexing 25 records: []
[2024-05-09, 22:02:45 UTC] {{taskinstance.py:1345}} INFO - Marking task as FAILED. dag_id=harvest_collection, task_id=create_stage_index, execution_date=20240509T202553, start_date=20240509T220243, end_date=20240509T220245
[2024-05-09, 22:02:45 UTC] {{logging_mixin.py:150}} INFO - Message sent to SNS with Message ID: fe9a5ca7-3486-51a0-9307-02107ec3ec29
[2024-05-09, 22:02:45 UTC] {{standard_task_runner.py:104}} ERROR - Failed to execute job 129040 for task create_stage_index (0 errors in bulk indexing 25 records: []; 187)
[2024-05-09, 22:02:45 UTC] {{local_task_job_runner.py:225}} INFO - Task exited with return code 1
[2024-05-09, 22:02:45 UTC] {{taskinstance.py:2653}} INFO - 0 downstream tasks scheduled from follow-on schedule check
barbarahui commented 6 months ago

@christinklez can you try one more time?

christinklez commented 6 months ago

Done! Green squares, looks like it went through!

Here's what the log says, just as an fyi:

[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz000100hh', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz000100hh]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz000100j1', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz000100j1]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztdp', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztdp]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt6k', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt6k]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt73', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt73]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztqb', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztqb]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztrv', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztrv]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztpt', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztpt]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt94', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt94]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt8m', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt8m]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zv1g', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zv1g]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000ztsc', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000ztsc]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/zz0000zt52', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/zz0000zt52]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/0.jsonl`
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_at field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_by field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had item_count field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had media_source field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/1.jsonl`
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_at field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_by field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had item_count field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had media_source field removed
[2024-05-09, 22:30:27 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/n1b317', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/n1b317]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/n1fs6z', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/n1fs6z]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/n16c9s', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/n16c9s]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/n1xw4k', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/n1xw4k]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - WARNING - document already exists; not creating.
 {'create': {'_index': 'rikolti-stg-combined-20240508151215', '_id': 'ark:/21198/n12k6g', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[ark:/21198/n12k6g]: version conflict, document already exists (current version [2])', 'index': 'rikolti-stg-combined-20240508151215', 'shard': '0', 'index_uuid': '11KZILPMsrT8eb863fOrk8-Q'}}}
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO - added 9 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/10.jsonl`
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO -     all 9 records had is_shown_at field removed
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO -     all 9 records had is_shown_by field removed
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO -     all 9 records had item_count field removed
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO -     all 9 records had media_source field removed
[2024-05-09, 22:30:28 UTC] {{logging_mixin.py:150}} INFO -     all 9 records had thumbnail_source field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/2.jsonl`
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_at field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_by field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had item_count field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had media_source field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/3.jsonl`
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_at field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_by field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had item_count field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had media_source field removed
[2024-05-09, 22:30:29 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:30 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/4.jsonl`
[2024-05-09, 22:30:30 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_at field removed
[2024-05-09, 22:30:30 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_by field removed
[2024-05-09, 22:30:30 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had item_count field removed
[2024-05-09, 22:30:30 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had media_source field removed
[2024-05-09, 22:30:30 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/5.jsonl`
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_at field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_by field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had item_count field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had media_source field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/6.jsonl`
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_at field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_by field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had item_count field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had media_source field removed
[2024-05-09, 22:30:31 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/7.jsonl`
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_at field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_by field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had item_count field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had media_source field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/8.jsonl`
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_at field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_by field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had item_count field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had media_source field removed
[2024-05-09, 22:30:32 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO - added 25 records to index `rikolti-stg-combined-20240508151215` from page `28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/9.jsonl`
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_at field removed
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had is_shown_by field removed
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had item_count field removed
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had media_source field removed
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO -     all 25 records had thumbnail_source field removed
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO - 

Review indexed records at: https://rikolti-data.s3.us-west-2.amazonaws.com/index.html#28121/vernacular_metadata_2024-05-09T20:26:04/mapped_metadata_2024-05-09T20:26:19/with_content_urls_2024-05-09T20:26:36/data/ 

Or on opensearch at: https://search-rikolti-2-xxbcriyfw5iqysaj7p3fhhscae.us-west-2.es.amazonaws.com/_dashboards/app/dev_tools#/console with query:
{
  "query": {
    "bool": {
      "filter": {
        "terms": {
          "collection_url": [
            28121
          ]
        }
      }
    }
  }
}
[2024-05-09, 22:30:33 UTC] {{logging_mixin.py:150}} INFO - Message sent to SNS with Message ID: e0d91176-e06c-5930-ad14-4fc1b0a5ac3f
[2024-05-09, 22:30:33 UTC] {{python.py:183}} INFO - Done. Returned value was: None
[2024-05-09, 22:30:33 UTC] {{taskinstance.py:1345}} INFO - Marking task as SUCCESS. dag_id=harvest_collection, task_id=create_stage_index, execution_date=20240509T202553, start_date=20240509T223025, end_date=20240509T223033
[2024-05-09, 22:30:33 UTC] {{local_task_job_runner.py:225}} INFO - Task exited with return code 0
[2024-05-09, 22:30:33 UTC] {{taskinstance.py:2653}} INFO - 0 downstream tasks scheduled from follow-on schedule check