ucldc / rikolti

calisphere harvester 2.0
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

[low priority bug] `content harvest` error for a Nuxeo collection: `outofmemoryerror: container killed due to memory usage` -- supersized Nuxeo images #885

Open christinklez opened 5 months ago

christinklez commented 5 months ago

Mapper: Nuxeo Collection ID: 26973

Run ID: manual2024-04-24T00:05:08+00:00 Permalink to the log: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/log?dag_id=harvest_collection&task_id=content_harvesting.content_harvest&execution_date=2024-04-24T00%3A05%3A08%2B00%3A00&map_index=0 Link to the gridview: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/harvest_collection/grid?dag_run_id=manual2024-04-24T00%3A05%3A08%2B00%3A00&task_id=content_harvesting.content_harvest&tab=mapped_tasks&map_index=0&num_runs=365

[2024-04-24, 00:06:24 UTC] {{base.py:73}} INFO - Using connection ID 'aws_default' for task execution.
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] Harvesting content for 1 records at 26973/vernacular_metadata_2024-04-24T00:05:18/mapped_metadata_2024-04-24T00:05:30/data/r-p0.jsonl
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] Mime-type 'image/tiff' was pre-checked and recognized as something we can try to convert.
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] --------------------------------
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] ERROR: make_jp2 command failed: ['/usr/bin/convert', '-quiet', '-format', '-jp2', '-define', 'jp2:rate=10', '/tmp/CiudaddeMexico.tif[0]', '/tmp/CiudaddeMexico.jp2']
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] returncode was: -9
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] output was: 
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] --------------------------------
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] 26973/vernacular_metadata_2024-04-24T00:05:18/mapped_metadata_2024-04-24T00:05:30/data/r-p0.jsonl
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] Harvested 0 media records
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] 1/1 records described a media source
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] Source Media Mimetypes: Counter({'image/tiff': 1})
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] Destination Media Mimetypes: Counter({None: 1})
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] Harvested 1 thumbnail records
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] 1/1 described a thumbnail source
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] Source Thumbnail Mimetypes: Counter({'image/jpeg': 1})
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] Destination Thumbnail Mimetypes: Counter({'image/jpeg': 1})
[2024-04-24, 00:07:54 UTC] {{task_log_fetcher.py:65}} INFO - [2024-04-24, 00:07:49 UTC] [{'thumb_source_mimetypes': Counter({'image/jpeg': 1}), 'thumb_mimetypes': Counter({'image/jpeg': 1}), 'media_source_mimetypes': Counter({'image/tiff': 1}), 'media_mimetypes': Counter({None: 1}), 'children': 0, 'records': 1, 'with_content_urls_filepath': '26973/vernacular_metadata_2024-04-24T00:05:18/mapped_metadata_2024-04-24T00:05:30/with_content_urls_2024-04-24T00:05:47/data/r-p0.jsonl'}]
[2024-04-24, 00:08:25 UTC] {{ecs.py:619}} INFO - ECS Task stopped, check status: {'tasks': [{'attachments': [{'id': 'df00839f-4c84-4485-bb8e-0fbb21127f9b', 'type': 'ElasticNetworkInterface', 'status': 'DELETED', 'details': [{'name': 'subnetId', 'value': 'subnet-09e65806b80ebad6b'}, {'name': 'networkInterfaceId', 'value': 'eni-0bbc62c5ab40c2f9f'}, {'name': 'macAddress', 'value': '02:f7:14:30:a9:3d'}, {'name': 'privateDnsName', 'value': 'ip-10-192-21-186.us-west-2.compute.internal'}, {'name': 'privateIPv4Address', 'value': '10.192.21.186'}]}], 'attributes': [{'name': 'ecs.cpu-architecture', 'value': 'x86_64'}], 'availabilityZone': 'us-west-2b', 'clusterArn': 'arn:aws:ecs:us-west-2:777968769372:cluster/rikolti-ecs-cluster', 'connectivity': 'CONNECTED', 'connectivityAt': datetime.datetime(2024, 4, 24, 0, 5, 57, 673000, tzinfo=tzlocal()), 'containers': [{'containerArn': 'arn:aws:ecs:us-west-2:777968769372:container/rikolti-ecs-cluster/9a86a3a2b74f4b859c29b7983f047218/047d644c-0ccf-4a7c-9ae2-09fd1eada1f0', 'taskArn': 'arn:aws:ecs:us-west-2:777968769372:task/rikolti-ecs-cluster/9a86a3a2b74f4b859c29b7983f047218', 'name': 'rikolti-content_harvester', 'image': 'public.ecr.aws/b6c7x7s4/rikolti/content_harvester:latest', 'imageDigest': 'sha256:22357206871a0636ee73331963282e84f527c43aaaf74e267378c68e2cdc85bf', 'runtimeId': '9a86a3a2b74f4b859c29b7983f047218-3530892436', 'lastStatus': 'STOPPED', 'exitCode': 0, 'reason': 'OutOfMemoryError: Container killed due to memory usage', 'networkBindings': [], 'networkInterfaces': [{'attachmentId': 'df00839f-4c84-4485-bb8e-0fbb21127f9b', 'privateIpv4Address': '10.192.21.186'}], 'healthStatus': 'UNKNOWN', 'cpu': '0'}], 'cpu': '1024', 'createdAt': datetime.datetime(2024, 4, 24, 0, 5, 54, 324000, tzinfo=tzlocal()), 'desiredStatus': 'STOPPED', 'enableExecuteCommand': False, 'executionStoppedAt': datetime.datetime(2024, 4, 24, 0, 7, 49, 420000, tzinfo=tzlocal()), 'group': 'family:rikolti-content_harvester-task-definition', 'healthStatus': 'UNKNOWN', 'lastStatus': 'STOPPED', 'launchType': 'FARGATE', 'memory': '3072', 'overrides': {'containerOverrides': [{'name': 'rikolti-content_harvester', 'command': ['26973', '["26973/vernacular_metadata_2024-04-24T00:05:18/mapped_metadata_2024-04-24T00:05:30/data/r-p0.jsonl"]', '26973/vernacular_metadata_2024-04-24T00:05:18/mapped_metadata_2024-04-24T00:05:30/with_content_urls_2024-04-24T00:05:47/', 'nuxeo.nuxeo'], 'environment': [{'name': 'MAPPED_DATA', 'value': 's3://rikolti-data'}, {'name': 'WITH_CONTENT_URL_DATA', 'value': 's3://rikolti-data'}, {'name': 'CONTENT_ROOT', 'value': 's3://rikolti-content'}, {'name': 'NUXEO_USER', 'value': 'Administrator'}, {'name': 'NUXEO_PASS', 'value': 'cable8:ringmasters'}, {'name': 'AWS_RETRY_MODE', 'value': 'standard'}, {'name': 'AWS_MAX_ATTEMPTS', 'value': '10'}]}], 'inferenceAcceleratorOverrides': []}, 'platformVersion': '1.4.0', 'platformFamily': 'Linux', 'pullStartedAt': datetime.datetime(2024, 4, 24, 0, 6, 5, 878000, tzinfo=tzlocal()), 'pullStoppedAt': datetime.datetime(2024, 4, 24, 0, 6, 26, 829000, tzinfo=tzlocal()), 'startedAt': datetime.datetime(2024, 4, 24, 0, 6, 31, 17000, tzinfo=tzlocal()), 'startedBy': 'airflow', 'stopCode': 'EssentialContainerExited', 'stoppedAt': datetime.datetime(2024, 4, 24, 0, 8, 14, 468000, tzinfo=tzlocal()), 'stoppedReason': 'Essential container in task exited', 'stoppingAt': datetime.datetime(2024, 4, 24, 0, 7, 59, 446000, tzinfo=tzlocal()), 'tags': [], 'taskArn': 'arn:aws:ecs:us-west-2:777968769372:task/rikolti-ecs-cluster/9a86a3a2b74f4b859c29b7983f047218', 'taskDefinitionArn': 'arn:aws:ecs:us-west-2:777968769372:task-definition/rikolti-content_harvester-task-definition:6', 'version': 6, 'ephemeralStorage': {'sizeInGiB': 20}}], 'failures': [], 'ResponseMetadata': {'RequestId': 'ffbbbb1c-0e48-4ff4-bd57-e4db88b6ee56', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'ffbbbb1c-0e48-4ff4-bd57-e4db88b6ee56', 'content-type': 'application/x-amz-json-1.1', 'content-length': '3143', 'date': 'Wed, 24 Apr 2024 00:08:24 GMT'}, 'RetryAttempts': 0}}
[2024-04-24, 00:08:25 UTC] {{taskinstance.py:1824}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/dags/rikolti/dags/shared_tasks/content_harvest_operators.py", line 116, in execute
    return super().execute(context)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/utils/session.py", line 76, in wrapper
    return func(*args, session=session, **kwargs)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 476, in execute
    self._start_wait_check_task(context)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/hooks/base_aws.py", line 743, in decorator_f
    return fun(self, *args, **kwargs)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 511, in _start_wait_check_task
    self._check_success_task()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/amazon/aws/operators/ecs.py", line 659, in _check_success_task
    raise AirflowException(
airflow.exceptions.AirflowException: This containers encounter an error during launching: outofmemoryerror: container killed due to memory usage
[2024-04-24, 00:08:25 UTC] {{taskinstance.py:1345}} INFO - Marking task as FAILED. dag_id=harvest_collection, task_id=content_harvesting.content_harvest, map_index=0, execution_date=20240424T000508, start_date=20240424T000552, end_date=20240424T000825
[2024-04-24, 00:08:25 UTC] {{logging_mixin.py:150}} INFO - Message sent to SNS with Message ID: a230118f-e3d1-5cd5-9756-4f2dac685fbe
[2024-04-24, 00:08:25 UTC] {{standard_task_runner.py:104}} ERROR - Failed to execute job 77525 for task content_harvesting.content_harvest (This containers encounter an error during launching: outofmemoryerror: container killed due to memory usage; 846)
[2024-04-24, 00:08:25 UTC] {{local_task_job_runner.py:225}} INFO - Task exited with return code 1
barbarahui commented 5 months ago

This object has a 1GB tiff file: https://nuxeo.cdlib.org/nuxeo/nxdoc/default/83cbedca-e4cd-45ec-b2d1-490d6b9fd951/view_documents?conversationId=0NXMAIN

Do we want to support such enormous image files?

aturner commented 5 months ago

@barbarahui Nuxeo should support importing of large TIFFs as the main content file -- but we've previously leaned on Nuxeo to generate a JPEG derivative to fetch. We somehow got a JPEG derivative to display in CSphere prod for this object (simple image Nuxeo object, from UCB Bancroft), last time we harvested; https://calisphere.org/item/83cbedca-e4cd-45ec-b2d1-490d6b9fd951/ . Did the derivative-generation process somehow fail?

That said -- @christinklez is this one we could just ETL for the time being?

christinklez commented 5 months ago

@aturner we can't ETL this because it is coming from Nuxeo.

barbarahui commented 5 months ago

Hmm yeah, I think the legacy deep harvester grabbed the "medium" sized image from Nuxeo rather than the full-sized image. I'll take a look at the rikolti content harvester code to see what's going on.

It's also possible that the legacy deep harvesting infrastructure supported processing larger files. There's also the option to bump up the memory on the ECS workers.

barbarahui commented 5 months ago

@aturner @christinklez Actually the legacy deep harvester did just grab the main content file. I'm guessing we had to use XL workers for processing this collection/object.

We haven't set up the rikolti harvester to be able to spin up mega-workers just for a specific collection. I guess we could temporarily reconfigure it for just this collection...

barbarahui commented 5 months ago

@aturner @christinklez It just occurred to me that you should be able to ETL this collection since there's only 1 simple object. It's the complex objects that the ETL mapper doesn't support.

christinklez commented 5 months ago

@barbarahui -- Exciting turn of events! I'll test this out now!!!!

christinklez commented 5 months ago

@barbarahui @aturner -- ETL works for Nuxeo simple objects: https://calisphere-stage.cdlib.org/item/83cbedca-e4cd-45ec-b2d1-490d6b9fd951/

However, it's not supported by a viewer, and includes a circular link back to the page we're on.

Given that this is a UCB Bancroft collection, and that UCB is working on moving these collections off Nuxeo to replace them with a TIND reharvest, I don't know if it's worth prioritizing this issue at this moment. (Unless we come across other collections with supersized images in Nuxeo.)

barbarahui commented 5 months ago

Ohh, it's because the frontend determines whether or not the item is hosted based on mapper type. I can update the mapper type in OpenSearch to be nuxeo.nuxeo for this one record. It's a bit hacky, and if you re-ETL this record it will break the viewer again but I think it's an acceptable compromise for now?

christinklez commented 5 months ago

@barbarahui I think that sounds acceptable for this record--I'll record this in our notes! @aturner, what do you think??

aturner commented 5 months ago

This sounds good to me -- as a back-pocket strategy for ETL'ing simple Nuxeo objects into the new index, for the time being!

barbarahui commented 5 months ago

Sadly this doesn't work. I forgot that since we ETLed the collection, the content harvester doesn't create the jp2 and so there isn't anything to display in the image viewer 👎

I vote for leaving this as is for now, if that's acceptable? As you say above, it's probably not worth doing the work to accommodate this object unless we run into a lot more supersized content files?

christinklez commented 5 months ago

I agree that we can move forward with this particular record as is. It's a quirky frontend experience, but it's not completely broken.

Let's consider this a low priority issue for now, and revisit if it happens to come up for other Nuxeo collections. Thank you for looking into this, Barbara!!