ucldc / rikolti

calisphere harvester 2.0
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

"content_harvest" fails for entire collection `26203`; 1 item (?) lacks an image file #933

Closed aturner closed 6 months ago

aturner commented 6 months ago

Issue: CSU Sacto's collection 26203 (CONTENTdm) fails at the "content_harvest" point; from the logs, it looks like it is tripping up on 1 item (?), an image object. (Similar error to https://github.com/ucldc/rikolti/issues/932). Legacy harvester somehow took the object in, and we're displaying a missing thumbnail. But the object in CONTENTdm looks like it does have an image file.

For this case, should we <gulp!> "edit-and-forget" this one, by stashing an image file in S3 to use for the thumbnail? Or skip over/exclude it?

Collection ID: 26203

Rikolti mapper type: ETL

Airflow Run ID: manual__2024-05-09T22:08:22+00:00

Airflow log: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/harvest_collection/grid?base_date=manual__2024-05-09T22%3A08%3A22%2B00%3A00&num_runs=365&dag_run_id=manual__2024-05-09T22%3A08%3A22%2B00%3A00

Item:

==

ip-10-192-21-8.us-west-2.compute.internal
*** Reading remote log from Cloudwatch log_group: airflow-pad-airflow-mwaa-Task log_stream: dag_id=harvest_collection/run_id=manual__2024-05-09T22_08_22+00_00/task_id=content_harvesting.content_harvest/map_index=16/attempt=1.log.
[2024-05-09, 15:14:52 PDT] {{taskinstance.py:1103}} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: harvest_collection.content_harvesting.content_harvest manual__2024-05-09T22:08:22+00:00 map_index=16 [queued]>
[2024-05-09, 15:14:52 PDT] {{taskinstance.py:1103}} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: harvest_collection.content_harvesting.content_harvest manual__2024-05-09T22:08:22+00:00 map_index=16 [queued]>
[2024-05-09, 15:14:52 PDT] {{taskinstance.py:1308}} INFO - Starting attempt 1 of 1
[2024-05-09, 15:14:52 PDT] {{taskinstance.py:1327}} INFO - Executing <Mapped(ContentHarvestEcsOperator): content_harvesting.content_harvest> on 2024-05-09 22:08:22+00:00
[2024-05-09, 15:14:52 PDT] {{standard_task_runner.py:57}} INFO - Started process 307 to run task
[2024-05-09, 15:14:52 PDT] {{standard_task_runner.py:84}} INFO - Running: ['airflow', 'tasks', 'run', 'harvest_collection', 'content_harvesting.content_harvest', 'manual__2024-05-09T22:08:22+00:00', '--job-id', '129236', '--raw', '--subdir', 'DAGS_FOLDER/rikolti/dags/harvest_dag.py', '--cfg-path', '/tmp/tmpuht5ouyz', '--map-index', '16']
[2024-05-09, 15:14:52 PDT] {{standard_task_runner.py:85}} INFO - Job 129236: Subtask content_harvesting.content_harvest
[2024-05-09, 15:14:53 PDT] {{task_command.py:410}} INFO - Running <TaskInstance: harvest_collection.content_harvesting.content_harvest manual__2024-05-09T22:08:22+00:00 map_index=16 [running]> on host ip-10-192-21-8.us-west-2.compute.internal
[2024-05-09, 15:14:54 PDT] {{taskinstance.py:1545}} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='airflow' AIRFLOW_CTX_DAG_ID='harvest_collection' AIRFLOW_CTX_TASK_ID='content_harvesting.content_harvest' AIRFLOW_CTX_EXECUTION_DATE='2024-05-09T22:08:22+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2024-05-09T22:08:22+00:00'
[2024-05-09, 15:14:54 PDT] {{ecs.py:468}} INFO - Running ECS Task - Task definition: rikolti-content_harvester-task-definition - on cluster rikolti-ecs-cluster
[2024-05-09, 15:14:54 PDT] {{ecs.py:471}} INFO - EcsOperator overrides: {'containerOverrides': [{'name': 'rikolti-content_harvester', 'command': ['26203', '["26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/128.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/129.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/130.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/131.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/132.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/133.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/134.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/135.jsonl"]', '26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/with_content_urls_2024-05-09T22:14:15/', 'calisphere_solr.calisphere_solr'], 'environment': [{'name': 'MAPPED_DATA', 'value': 's3://rikolti-data'}, {'name': 'WITH_CONTENT_URL_DATA', 'value': 's3://rikolti-data'}, {'name': 'CONTENT_ROOT', 'value': 's3://rikolti-content'}, {'name': 'NUXEO_USER', 'value': 'Administrator'}, {'name': 'NUXEO_PASS', 'value': 'cable8:ringmasters'}, {'name': 'AWS_RETRY_MODE', 'value': 'standard'}, {'name': 'AWS_MAX_ATTEMPTS', 'value': '10'}]}]}
[2024-05-09, 15:14:54 PDT] {{base.py:73}} INFO - Using connection ID 'aws_default' for task execution.
[2024-05-09, 15:14:54 PDT] {{ecs.py:576}} INFO - No active previously launched task found to reattach
[2024-05-09, 15:14:55 PDT] {{ecs.py:548}} INFO - ECS Task started: {'tasks': [{'attachments': [{'id': '39576f8e-ce84-4684-b789-09f5d7cd68e0', 'type': 'ElasticNetworkInterface', 'status': 'PRECREATED', 'details': [{'name': 'subnetId', 'value': 'subnet-09e65806b80ebad6b'}]}], 'attributes': [{'name': 'ecs.cpu-architecture', 'value': 'x86_64'}], 'availabilityZone': 'us-west-2b', 'clusterArn': 'arn:aws:ecs:us-west-2:777968769372:cluster/rikolti-ecs-cluster', 'containers': [{'containerArn': 'arn:aws:ecs:us-west-2:777968769372:container/rikolti-ecs-cluster/89364d0c40304e87888bcbda46d3dd04/ba35a920-7540-46af-ba4d-40a4fb8edf68', 'taskArn': 'arn:aws:ecs:us-west-2:777968769372:task/rikolti-ecs-cluster/89364d0c40304e87888bcbda46d3dd04', 'name': 'rikolti-content_harvester', 'image': 'public.ecr.aws/b6c7x7s4/rikolti/content_harvester:latest', 'lastStatus': 'PENDING', 'networkInterfaces': [], 'cpu': '0'}], 'cpu': '1024', 'createdAt': datetime.datetime(2024, 5, 9, 22, 14, 55, 216000, tzinfo=tzlocal()), 'desiredStatus': 'RUNNING', 'enableExecuteCommand': False, 'group': 'family:rikolti-content_harvester-task-definition', 'lastStatus': 'PROVISIONING', 'launchType': 'FARGATE', 'memory': '3072', 'overrides': {'containerOverrides': [{'name': 'rikolti-content_harvester', 'command': ['26203', '["26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/128.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/129.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/130.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/131.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/132.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/133.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/134.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/135.jsonl"]', '26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/with_content_urls_2024-05-09T22:14:15/', 'calisphere_solr.calisphere_solr'], 'environment': [{'name': 'MAPPED_DATA', 'value': 's3://rikolti-data'}, {'name': 'WITH_CONTENT_URL_DATA', 'value': 's3://rikolti-data'}, {'name': 'CONTENT_ROOT', 'value': 's3://rikolti-content'}, {'name': 'NUXEO_USER', 'value': 'Administrator'}, {'name': 'NUXEO_PASS', 'value': 'cable8:ringmasters'}, {'name': 'AWS_RETRY_MODE', 'value': 'standard'}, {'name': 'AWS_MAX_ATTEMPTS', 'value': '10'}]}], 'inferenceAcceleratorOverrides': []}, 'platformVersion': '1.4.0', 'platformFamily': 'Linux', 'startedBy': 'airflow', 'tags': [], 'taskArn': 'arn:aws:ecs:us-west-2:777968769372:task/rikolti-ecs-cluster/89364d0c40304e87888bcbda46d3dd04', 'taskDefinitionArn': 'arn:aws:ecs:us-west-2:777968769372:task-definition/rikolti-content_harvester-task-definition:6', 'version': 1, 'ephemeralStorage': {'sizeInGiB': 20}}], 'failures': [], 'ResponseMetadata': {'RequestId': '70d083a5-4b25-470a-87f9-4128427c89a8', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '70d083a5-4b25-470a-87f9-4128427c89a8', 'content-type': 'application/x-amz-json-1.1', 'content-length': '2907', 'date': 'Thu, 09 May 2024 22:14:54 GMT'}, 'RetryAttempts': 0}}
[2024-05-09, 15:14:55 PDT] {{ecs.py:551}} INFO - ECS task ID is: 89364d0c40304e87888bcbda46d3dd04
[2024-05-09, 15:14:55 PDT] {{ecs.py:499}} INFO - Starting ECS Task Log Fetcher
[2024-05-09, 15:15:25 PDT] {{base.py:73}} INFO - Using connection ID 'aws_default' for task execution.
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,581] Traceback (most recent call last):
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,581]   File "/content_harvester/by_record.py", line 153, in get_dimensions
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,581]     return Image.open(filepath).size
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,581]            ^^^^^^^^^^^^^^^^^^^^
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,581]   File "/usr/local/lib/python3.12/site-packages/PIL/Image.py", line 3339, in open
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,582]     raise UnidentifiedImageError(msg)
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,582] PIL.UnidentifiedImageError: cannot identify image file '/tmp/2e182ecbb93001f342368206b0a8a63d'
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,582] During handling of the above exception, another exception occurred:
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,582] Traceback (most recent call last):
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,582]   File "<frozen runpy>", line 198, in _run_module_as_main
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,582]   File "<frozen runpy>", line 88, in _run_code
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,582]   File "/content_harvester/by_page.py", line 98, in <module>
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,582]     print_value.append(harvest_page_content(
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,582]                        ^^^^^^^^^^^^^^^^^^^^^
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,582]   File "/content_harvester/by_page.py", line 31, in harvest_page_content
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,582]     record_with_content = harvest_record_content(
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,582]                           ^^^^^^^^^^^^^^^^^^^^^^^
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,583]   File "/content_harvester/by_record.py", line 113, in harvest_record_content
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,583]     dimensions = get_dimensions(thumbnail.tmp_filepath, record['calisphere-id'])
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,583]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,583]   File "/content_harvester/by_record.py", line 155, in get_dimensions
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,583]     raise Exception(
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,583] Exception: PIL.UnidentifiedImageError for calisphere-id oai:cdm16855.contentdm.oclc.org:p16855coll4/55370: cannot identify image file '/tmp/2e182ecbb93001f342368206b0a8a63d'
[2024-05-09, 15:15:55 PDT] {{task_log_fetcher.py:65}} INFO - [2024-05-09 22:15:34,584] Harvesting content for 100 records at 26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/128.jsonl
[2024-05-09, 15:16:26 PDT] {{ecs.py:619}} INFO - ECS Task stopped, check status: {'tasks': [{'attachments': [{'id': '39576f8e-ce84-4684-b789-09f5d7cd68e0', 'type': 'ElasticNetworkInterface', 'status': 'DELETED', 'details': [{'name': 'subnetId', 'value': 'subnet-09e65806b80ebad6b'}, {'name': 'networkInterfaceId', 'value': 'eni-015a6b0d0fa5610da'}, {'name': 'macAddress', 'value': '02:49:30:2c:da:31'}, {'name': 'privateDnsName', 'value': 'ip-10-192-21-125.us-west-2.compute.internal'}, {'name': 'privateIPv4Address', 'value': '10.192.21.125'}]}], 'attributes': [{'name': 'ecs.cpu-architecture', 'value': 'x86_64'}], 'availabilityZone': 'us-west-2b', 'clusterArn': 'arn:aws:ecs:us-west-2:777968769372:cluster/rikolti-ecs-cluster', 'connectivity': 'CONNECTED', 'connectivityAt': datetime.datetime(2024, 5, 9, 22, 15, 0, 300000, tzinfo=tzlocal()), 'containers': [{'containerArn': 'arn:aws:ecs:us-west-2:777968769372:container/rikolti-ecs-cluster/89364d0c40304e87888bcbda46d3dd04/ba35a920-7540-46af-ba4d-40a4fb8edf68', 'taskArn': 'arn:aws:ecs:us-west-2:777968769372:task/rikolti-ecs-cluster/89364d0c40304e87888bcbda46d3dd04', 'name': 'rikolti-content_harvester', 'image': 'public.ecr.aws/b6c7x7s4/rikolti/content_harvester:latest', 'imageDigest': 'sha256:88220ec5e1e0c5d26b9c818d837ff6bf7f3307efefcdf864dce0a4fad14dd6f4', 'runtimeId': '89364d0c40304e87888bcbda46d3dd04-3530892436', 'lastStatus': 'STOPPED', 'exitCode': 1, 'networkBindings': [], 'networkInterfaces': [{'attachmentId': '39576f8e-ce84-4684-b789-09f5d7cd68e0', 'privateIpv4Address': '10.192.21.125'}], 'healthStatus': 'UNKNOWN', 'cpu': '0'}], 'cpu': '1024', 'createdAt': datetime.datetime(2024, 5, 9, 22, 14, 55, 216000, tzinfo=tzlocal()), 'desiredStatus': 'STOPPED', 'enableExecuteCommand': False, 'executionStoppedAt': datetime.datetime(2024, 5, 9, 22, 15, 34, 647000, tzinfo=tzlocal()), 'group': 'family:rikolti-content_harvester-task-definition', 'healthStatus': 'UNKNOWN', 'lastStatus': 'STOPPED', 'launchType': 'FARGATE', 'memory': '3072', 'overrides': {'containerOverrides': [{'name': 'rikolti-content_harvester', 'command': ['26203', '["26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/128.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/129.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/130.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/131.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/132.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/133.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/134.jsonl", "26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/135.jsonl"]', '26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/with_content_urls_2024-05-09T22:14:15/', 'calisphere_solr.calisphere_solr'], 'environment': [{'name': 'MAPPED_DATA', 'value': 's3://rikolti-data'}, {'name': 'WITH_CONTENT_URL_DATA', 'value': 's3://rikolti-data'}, {'name': 'CONTENT_ROOT', 'value': 's3://rikolti-content'}, {'name': 'NUXEO_USER', 'value': 'Administrator'}, {'name': 'NUXEO_PASS', 'value': 'cable8:ringmasters'}, {'name': 'AWS_RETRY_MODE', 'value': 'standard'}, {'name': 'AWS_MAX_ATTEMPTS', 'value': '10'}]}], 'inferenceAcceleratorOverrides': []}, 'platformVersion': '1.4.0', 'platformFamily': 'Linux', 'pullStartedAt': datetime.datetime(2024, 5, 9, 22, 15, 8, 836000, tzinfo=tzlocal()), 'pullStoppedAt': datetime.datetime(2024, 5, 9, 22, 15, 27, 938000, tzinfo=tzlocal()), 'startedAt': datetime.datetime(2024, 5, 9, 22, 15, 31, 977000, tzinfo=tzlocal()), 'startedBy': 'airflow', 'stopCode': 'EssentialContainerExited', 'stoppedAt': datetime.datetime(2024, 5, 9, 22, 15, 57, 548000, tzinfo=tzlocal()), 'stoppedReason': 'Essential container in task exited', 'stoppingAt': datetime.datetime(2024, 5, 9, 22, 15, 44, 674000, tzinfo=tzlocal()), 'tags': [], 'taskArn': 'arn:aws:ecs:us-west-2:777968769372:task/rikolti-ecs-cluster/89364d0c40304e87888bcbda46d3dd04', 'taskDefinitionArn': 'arn:aws:ecs:us-west-2:777968769372:task-definition/rikolti-content_harvester-task-definition:6', 'version': 5, 'ephemeralStorage': {'sizeInGiB': 20}}], 'failures': [], 'ResponseMetadata': {'RequestId': 'e2ea2f6e-adf9-4598-85ec-07127d28f5dd', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'e2ea2f6e-adf9-4598-85ec-07127d28f5dd', 'content-type': 'application/x-amz-json-1.1', 'content-length': '3809', 'date': 'Thu, 09 May 2024 22:16:25 GMT'}, 'RetryAttempts': 0}}
[2024-05-09, 15:16:26 PDT] {{taskinstance.py:1824}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/dags/rikolti/dags/shared_tasks/content_harvest_operators.py", line 116, in execute
    return super().execute(context)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/utils/session.py", line 76, in wrapper
    return func(*args, session=session, **kwargs)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 476, in execute
    self._start_wait_check_task(context)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/base_aws.py", line 743, in decorator_f
    return fun(self, *args, **kwargs)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 511, in _start_wait_check_task
    self._check_success_task()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 650, in _check_success_task
    raise AirflowException(
airflow.exceptions.AirflowException: This task is not in success state - last 100 logs from Cloudwatch:
Traceback (most recent call last):
  File "/content_harvester/by_record.py", line 153, in get_dimensions
    return Image.open(filepath).size
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/PIL/Image.py", line 3339, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file '/tmp/2e182ecbb93001f342368206b0a8a63d'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/content_harvester/by_page.py", line 98, in <module>
    print_value.append(harvest_page_content(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_page.py", line 31, in harvest_page_content
    record_with_content = harvest_record_content(
                          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 113, in harvest_record_content
    dimensions = get_dimensions(thumbnail.tmp_filepath, record['calisphere-id'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 155, in get_dimensions
    raise Exception(
Exception: PIL.UnidentifiedImageError for calisphere-id oai:cdm16855.contentdm.oclc.org:p16855coll4/55370: cannot identify image file '/tmp/2e182ecbb93001f342368206b0a8a63d'
Harvesting content for 100 records at 26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/128.jsonl
[2024-05-09, 15:16:26 PDT] {{taskinstance.py:1345}} INFO - Marking task as FAILED. dag_id=harvest_collection, task_id=content_harvesting.content_harvest, map_index=16, execution_date=20240509T220822, start_date=20240509T221452, end_date=20240509T221626
[2024-05-09, 15:16:26 PDT] {{logging_mixin.py:150}} INFO - Message sent to SNS with Message ID: fe55873e-5fcd-5f9b-9f04-d98912b7e452
[2024-05-09, 15:16:26 PDT] {{standard_task_runner.py:104}} ERROR - Failed to execute job 129236 for task content_harvesting.content_harvest (This task is not in success state - last 100 logs from Cloudwatch:
Traceback (most recent call last):
  File "/content_harvester/by_record.py", line 153, in get_dimensions
    return Image.open(filepath).size
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/PIL/Image.py", line 3339, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file '/tmp/2e182ecbb93001f342368206b0a8a63d'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/content_harvester/by_page.py", line 98, in <module>
    print_value.append(harvest_page_content(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_page.py", line 31, in harvest_page_content
    record_with_content = harvest_record_content(
                          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 113, in harvest_record_content
    dimensions = get_dimensions(thumbnail.tmp_filepath, record['calisphere-id'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 155, in get_dimensions
    raise Exception(
Exception: PIL.UnidentifiedImageError for calisphere-id oai:cdm16855.contentdm.oclc.org:p16855coll4/55370: cannot identify image file '/tmp/2e182ecbb93001f342368206b0a8a63d'
Harvesting content for 100 records at 26203/vernacular_metadata_2024-05-09T22:09:11/mapped_metadata_2024-05-09T22:13:16/data/128.jsonl; 307)
[2024-05-09, 15:16:26 PDT] {{local_task_job_runner.py:225}} INFO - Task exited with return code 1
[2024-05-09, 15:16:26 PDT] {{taskinstance.py:2653}} INFO - 0 downstream tasks scheduled from follow-on schedule check
aturner commented 6 months ago

Resolved with PR for issue 899