ucldc / rikolti

calisphere harvester 2.0
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

[data provider issue] `oai.quartex` content harvesting `urllib3.exceptions.ResponseError: too many 500 error responses` (3 records from 2 collections) #924

Open christinklez opened 6 months ago

christinklez commented 6 months ago

26213: Two records, there are no images displaying in the source platform.

26500: There is a downloadable PDF, but no PDF displays in the viewer. I downloaded the PDF file, but it seems to be damaged.

Registry ID: 26213

identifier=8e888f39-c075-4c33-94c0-b32e52a46dfb

identifier=4138dbac-862f-4f01-b09c-37bf6210a819

mapped index 0:

[2024-05-04, 00:41:23 UTC] {{standard_task_runner.py:104}} ERROR - Failed to execute job 105425 for task content_harvesting.content_harvest (This task is not in success state - last 100 logs from Cloudwatch:
urllib3.exceptions.ResponseError: too many 500 error responses
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 938, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='images.quartexcollections.com', port=443): Max retries exceeded with url: /lmudigitalcollections/thumbnails/preview/8e888f39-c075-4c33-94c0-b32e52a46dfb (Caused by ResponseError('too many 500 error responses'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/content_harvester/by_page.py", line 98, in <module>
    print_value.append(harvest_page_content(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_page.py", line 31, in harvest_page_content
    record_with_content = harvest_record_content(
                          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 105, in harvest_record_content
    downloaded_md5 = download_content(request, http, thumbnail.tmp_filepath)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 191, in download_content
    response = http.get(**request)
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/adapters.py", line 510, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='images.quartexcollections.com', port=443): Max retries exceeded with url: /lmudigitalcollections/thumbnails/preview/8e888f39-c075-4c33-94c0-b32e52a46dfb (Caused by ResponseError('too many 500 error responses'))
Harvesting content for 100 records at 26213/vernacular_metadata_2024-05-04T00:32:46/mapped_metadata_2024-05-04T00:33:53/data/0.jsonl; 4160)
[2024-05-04, 00:41:23 UTC] {{local_task_job_runner.py:225}} INFO - Task exited with return code 1
[2024-05-04, 00:41:23 UTC] {{taskinstance.py:2653}} INFO - 0 downstream tasks scheduled from follow-on schedule check

mapped index 1:

[2024-05-04, 00:40:51 UTC] {{standard_task_runner.py:104}} ERROR - Failed to execute job 105422 for task content_harvesting.content_harvest (This task is not in success state - last 100 logs from Cloudwatch:
urllib3.exceptions.ResponseError: too many 500 error responses
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 938, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='images.quartexcollections.com', port=443): Max retries exceeded with url: /lmudigitalcollections/thumbnails/preview/4138dbac-862f-4f01-b09c-37bf6210a819 (Caused by ResponseError('too many 500 error responses'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/content_harvester/by_page.py", line 98, in <module>
    print_value.append(harvest_page_content(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_page.py", line 31, in harvest_page_content
    record_with_content = harvest_record_content(
                          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 105, in harvest_record_content
    downloaded_md5 = download_content(request, http, thumbnail.tmp_filepath)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 191, in download_content
    response = http.get(**request)
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/adapters.py", line 510, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='images.quartexcollections.com', port=443): Max retries exceeded with url: /lmudigitalcollections/thumbnails/preview/4138dbac-862f-4f01-b09c-37bf6210a819 (Caused by ResponseError('too many 500 error responses'))
Harvesting content for 100 records at 26213/vernacular_metadata_2024-05-04T00:32:46/mapped_metadata_2024-05-04T00:33:53/data/1.jsonl; 4169)
[2024-05-04, 00:40:51 UTC] {{local_task_job_runner.py:225}} INFO - Task exited with return code 1

Registry ID: 26500

identifier=19b9c85b-bca8-4970-bf2f-e0bda9562aed

[2024-05-04, 00:47:08 UTC] {{standard_task_runner.py:104}} ERROR - Failed to execute job 105634 for task content_harvesting.content_harvest (This task is not in success state - last 100 logs from Cloudwatch:
urllib3.exceptions.ResponseError: too many 500 error responses
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 938, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='images.quartexcollections.com', port=443): Max retries exceeded with url: /sonomalibrary/thumbnails/preview/19b9c85b-bca8-4970-bf2f-e0bda9562aed (Caused by ResponseError('too many 500 error responses'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/content_harvester/by_page.py", line 98, in <module>
    print_value.append(harvest_page_content(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_page.py", line 31, in harvest_page_content
    record_with_content = harvest_record_content(
                          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 105, in harvest_record_content
    downloaded_md5 = download_content(request, http, thumbnail.tmp_filepath)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 191, in download_content
    response = http.get(**request)
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/adapters.py", line 510, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='images.quartexcollections.com', port=443): Max retries exceeded with url: /sonomalibrary/thumbnails/preview/19b9c85b-bca8-4970-bf2f-e0bda9562aed (Caused by ResponseError('too many 500 error responses'))
Harvesting content for 76 records at 26500/vernacular_metadata_2024-05-04T00:43:50/mapped_metadata_2024-05-04T00:44:37/data/0.jsonl; 2997)
[2024-05-04, 00:47:08 UTC] {{local_task_job_runner.py:225}} INFO - Task exited with return code 1
[2024-05-04, 00:47:08 UTC] {{taskinstance.py:2653}} INFO - 0 downstream tasks scheduled from follow-on schedule check
christinklez commented 5 months ago

These 2 collections have been ETL'd. Once ETL'd, their Registry record was updated back to OAI-PMH.

To do: We still need to contact the contributors in order to unblock harvesting directly from their platforms.