ucldc / rikolti

calisphere harvester 2.0
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

[data provider issue] `content harvest` error for a Nuxeo collection: `UnsupportedMimetype: Mime-type 'application/pdf' was pre-checked and recognized as something we don't want to convert.` - contacted UCB ESL #886

Closed christinklez closed 3 months ago

christinklez commented 4 months ago

Mapper: Nuxeo Collection ID: 28042

Run ID: manual2024-04-23T23:37:41+00:00 Permalink to the log: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/log?dag_id=harvest_collection&task_id=content_harvesting.content_harvest&execution_date=2024-04-23T23%3A37%3A41%2B00%3A00&map_index=0 Link to the gridview: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/harvest_collection/grid?dag_run_id=manual2024-04-23T23%3A37%3A41%2B00%3A00&task_id=content_harvesting.content_harvest&tab=mapped_tasks&num_runs=365&map_index=0

[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC] Traceback (most recent call last):
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC]   File "<frozen runpy>", line 198, in _run_module_as_main
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC]   File "<frozen runpy>", line 88, in _run_code
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC]   File "/content_harvester/by_page.py", line 98, in <module>
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC]     print_value.append(harvest_page_content(
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC]                        ^^^^^^^^^^^^^^^^^^^^^
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC]   File "/content_harvester/by_page.py", line 31, in harvest_page_content
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC]     record_with_content = harvest_record_content(
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC]                           ^^^^^^^^^^^^^^^^^^^^^^^
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC]   File "/content_harvester/by_record.py", line 71, in harvest_record_content
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC]     check_media_mimetype(media.src_mime_type)
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC]   File "/content_harvester/content_types.py", line 54, in check_media_mimetype
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC]     raise UnsupportedMimetype(
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC] content_harvester.content_types.UnsupportedMimetype: Mime-type 'application/pdf' was pre-checked and recognized as something we don't want to convert.
[2024-04-23, 23:39:31 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-23, 23:39:15 UTC] Harvesting content for 43 records at 28042/vernacular_metadata_2024-04-23T23:37:51/mapped_metadata_2024-04-23T23:38:07/data/r-p0.jsonl
[2024-04-23, 23:40:01 UTC] {{ecs.py:619}} INFO - ECS Task stopped, check status: {'tasks': [{'attachments': [{'id': 'ce7774a0-d133-4424-8312-3e046400b2de', 'type': 'ElasticNetworkInterface', 'status': 'DELETED', 'details': [{'name': 'subnetId', 'value': 'subnet-09e65806b80ebad6b'}, {'name': 'networkInterfaceId', 'value': 'eni-03c5eebb8790f24bb'}, {'name': 'macAddress', 'value': '02:7d:e4:78:84:eb'}, {'name': 'privateDnsName', 'value': 'ip-10-192-21-87.us-west-2.compute.internal'}, {'name': 'privateIPv4Address', 'value': '10.192.21.87'}]}], 'attributes': [{'name': 'ecs.cpu-architecture', 'value': 'x86_64'}], 'availabilityZone': 'us-west-2b', 'clusterArn': 'arn:aws:ecs:us-west-2:777968769372:cluster/rikolti-ecs-cluster', 'connectivity': 'CONNECTED', 'connectivityAt': datetime.datetime(2024, 4, 23, 23, 38, 34, 321000, tzinfo=tzlocal()), 'containers': [{'containerArn': 'arn:aws:ecs:us-west-2:777968769372:container/rikolti-ecs-cluster/508576ecbb22411d96a2c2ec36d579a8/0d74e61b-b4a5-4b61-9823-4ef5ec1816bf', 'taskArn': 'arn:aws:ecs:us-west-2:777968769372:task/rikolti-ecs-cluster/508576ecbb22411d96a2c2ec36d579a8', 'name': 'rikolti-content_harvester', 'image': 'public.ecr.aws/b6c7x7s4/rikolti/content_harvester:latest', 'imageDigest': 'sha256:22357206871a0636ee73331963282e84f527c43aaaf74e267378c68e2cdc85bf', 'runtimeId': '508576ecbb22411d96a2c2ec36d579a8-3530892436', 'lastStatus': 'STOPPED', 'exitCode': 1, 'networkBindings': [], 'networkInterfaces': [{'attachmentId': 'ce7774a0-d133-4424-8312-3e046400b2de', 'privateIpv4Address': '10.192.21.87'}], 'healthStatus': 'UNKNOWN', 'cpu': '0'}], 'cpu': '1024', 'createdAt': datetime.datetime(2024, 4, 23, 23, 38, 30, 911000, tzinfo=tzlocal()), 'desiredStatus': 'STOPPED', 'enableExecuteCommand': False, 'executionStoppedAt': datetime.datetime(2024, 4, 23, 23, 39, 15, 76000, tzinfo=tzlocal()), 'group': 'family:rikolti-content_harvester-task-definition', 'healthStatus': 'UNKNOWN', 'lastStatus': 'STOPPED', 'launchType': 'FARGATE', 'memory': '3072', 'overrides': {'containerOverrides': [{'name': 'rikolti-content_harvester', 'command': ['28042', '["28042/vernacular_metadata_2024-04-23T23:37:51/mapped_metadata_2024-04-23T23:38:07/data/r-p0.jsonl"]', '28042/vernacular_metadata_2024-04-23T23:37:51/mapped_metadata_2024-04-23T23:38:07/with_content_urls_2024-04-23T23:38:22/', 'nuxeo.nuxeo'], 'environment': [{'name': 'MAPPED_DATA', 'value': 's3://rikolti-data'}, {'name': 'WITH_CONTENT_URL_DATA', 'value': 's3://rikolti-data'}, {'name': 'CONTENT_ROOT', 'value': 's3://rikolti-content'}, {'name': 'NUXEO_USER', 'value': 'Administrator'}, {'name': 'NUXEO_PASS', 'value': 'cable8:ringmasters'}, {'name': 'AWS_RETRY_MODE', 'value': 'standard'}, {'name': 'AWS_MAX_ATTEMPTS', 'value': '10'}]}], 'inferenceAcceleratorOverrides': []}, 'platformVersion': '1.4.0', 'platformFamily': 'Linux', 'pullStartedAt': datetime.datetime(2024, 4, 23, 23, 38, 41, 822000, tzinfo=tzlocal()), 'pullStoppedAt': datetime.datetime(2024, 4, 23, 23, 39, 7, 948000, tzinfo=tzlocal()), 'startedAt': datetime.datetime(2024, 4, 23, 23, 39, 13, 120000, tzinfo=tzlocal()), 'startedBy': 'airflow', 'stopCode': 'EssentialContainerExited', 'stoppedAt': datetime.datetime(2024, 4, 23, 23, 39, 39, 916000, tzinfo=tzlocal()), 'stoppedReason': 'Essential container in task exited', 'stoppingAt': datetime.datetime(2024, 4, 23, 23, 39, 25, 102000, tzinfo=tzlocal()), 'tags': [], 'taskArn': 'arn:aws:ecs:us-west-2:777968769372:task/rikolti-ecs-cluster/508576ecbb22411d96a2c2ec36d579a8', 'taskDefinitionArn': 'arn:aws:ecs:us-west-2:777968769372:task-definition/rikolti-content_harvester-task-definition:6', 'version': 5, 'ephemeralStorage': {'sizeInGiB': 20}}], 'failures': [], 'ResponseMetadata': {'RequestId': '24ff9cdb-d3aa-4498-bf81-1513bcbaba8e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '24ff9cdb-d3aa-4498-bf81-1513bcbaba8e', 'content-type': 'application/x-amz-json-1.1', 'content-length': '3074', 'date': 'Tue, 23 Apr 2024 23:40:00 GMT'}, 'RetryAttempts': 0}}
[2024-04-23, 23:40:01 UTC] {{taskinstance.py:1824}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/dags/rikolti/dags/shared_tasks/content_harvest_operators.py", line 116, in execute
    return super().execute(context)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/utils/session.py", line 76, in wrapper
    return func(*args, session=session, **kwargs)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 476, in execute
    self._start_wait_check_task(context)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/base_aws.py", line 743, in decorator_f
    return fun(self, *args, **kwargs)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 511, in _start_wait_check_task
    self._check_success_task()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 650, in _check_success_task
    raise AirflowException(
airflow.exceptions.AirflowException: This task is not in success state - last 100 logs from Cloudwatch:
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/content_harvester/by_page.py", line 98, in <module>
    print_value.append(harvest_page_content(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_page.py", line 31, in harvest_page_content
    record_with_content = harvest_record_content(
                          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 71, in harvest_record_content
    check_media_mimetype(media.src_mime_type)
  File "/content_harvester/content_types.py", line 54, in check_media_mimetype
    raise UnsupportedMimetype(
content_harvester.content_types.UnsupportedMimetype: Mime-type 'application/pdf' was pre-checked and recognized as something we don't want to convert.
Harvesting content for 43 records at 28042/vernacular_metadata_2024-04-23T23:37:51/mapped_metadata_2024-04-23T23:38:07/data/r-p0.jsonl
[2024-04-23, 23:40:01 UTC] {{taskinstance.py:1345}} INFO - Marking task as FAILED. dag_id=harvest_collection, task_id=content_harvesting.content_harvest, map_index=0, execution_date=20240423T233741, start_date=20240423T233829, end_date=20240423T234001
[2024-04-23, 23:40:01 UTC] {{logging_mixin.py:150}} INFO - Message sent to SNS with Message ID: 556a7bd0-a299-5469-b946-2572eb88d084
[2024-04-23, 23:40:01 UTC] {{standard_task_runner.py:104}} ERROR - Failed to execute job 77347 for task content_harvesting.content_harvest (This task is not in success state - last 100 logs from Cloudwatch:
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/content_harvester/by_page.py", line 98, in <module>
    print_value.append(harvest_page_content(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_page.py", line 31, in harvest_page_content
    record_with_content = harvest_record_content(
                          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 71, in harvest_record_content
    check_media_mimetype(media.src_mime_type)
  File "/content_harvester/content_types.py", line 54, in check_media_mimetype
    raise UnsupportedMimetype(
content_harvester.content_types.UnsupportedMimetype: Mime-type 'application/pdf' was pre-checked and recognized as something we don't want to convert.
Harvesting content for 43 records at 28042/vernacular_metadata_2024-04-23T23:37:51/mapped_metadata_2024-04-23T23:38:07/data/r-p0.jsonl; 251)
barbarahui commented 4 months ago

It looks like these objects have a nuxeo type of SampleCustomPicture, but the content file is a PDF. Based on the nuxeo type, the content harvester is expecting to get an image file to convert to jp2. However, it encounters the PDF and so throws an error. I think we've had this issue with nuxeo objects before -- do you remember how we resolved it? We could have the contributor re-create these objects as images.

christinklez commented 4 months ago

Thanks @barbarahui!

Nuxeo project folder: https://nuxeo.cdlib.org/nuxeo/nxdoc/default/a2dcac48-b8fb-453a-9c21-88603a24da7f/view_documents

CSphere only has the 28 records published on production. All 43 Nuxeo records were created/modified ~April/June 2023. It looks like CSphere may have last harvested ~Nov 2023.

I suspect the legacy harvester possibly skipped these "image" records that have PDFs as their main file?

christinklez commented 4 months ago

Removing the bug label from this issue. This is a Nuxeo data entry/creation issue.

@aturner let's discuss how we should approach this!

aturner commented 4 months ago

@christinklez @barbarahui -- ah, the Nuxeo object doc type wasn't correctly set, at the time the PDFs were imported (should be "File" doc type). We've run into this issue before, and my understanding is there's no way to retroactively change the Nuxeo doc type -- the object needs to be rebuilt.

Christine, I can relay the info. to Sine at UCB Ethnic Studies Library, requesting to rebuild the objects -- from the results view, the PDF objects can be sussed out (for rebuilding): https://nuxeo.cdlib.org/nuxeo/nxpath/default/asset-library/UCB/UCB%20Ethnic%20Studies/CES/TWLF%2050th%20Anniversary%20Digital%20Scans@view_documents?tabIds=%3A&conversationId=0NXMAIN6

aturner commented 3 months ago

FD ticket -- message to Sine at UCB Ethnic Studies Library: https://help.oac.cdlib.org/a/tickets/137436

christinklez commented 3 months ago

UCB has moved these objects into a "do not publish" folder. This collection is now harvested through!

Closing this issue as resolved. When UCB updates these objects, they will send over a harvesting request / update.