ucldc / rikolti

calisphere harvester 2.0
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

[data provider issue] `oai.samvera` content harvesting `urllib3.exceptions.ResponseError: too many 500 error responses` - obsolete(?) image server (several records from 1 collection) #923

Open christinklez opened 7 months ago

christinklez commented 7 months ago

UCLA collection 28156 has a bunch of images being served by an obsolete(?) IIIF image server. (Some images are being served by their current IIIF server, but there are some that still point to Cantaloupe.)

This is a harvest-stopping error.

To do:

Registry ID: 28156

identifier: 6620651_1938-02-15_0210

This URL comes from the error log: host p-u-cantaloupe01.library.ucla.edu & /cantaloupe/iiif/2/Masters%2Fdlmasters%2Fnachrichtenburo%2Fimage%2FFeb_1938%2F6620651_1938-02-15_0210.tif/full/!200,200/0/default.jpg

Here's the mapped metadata:

{"alternative_title": ["Nachmittags- und Abend-Ausgabe", "Deutsches Nachrichtenbüro"], "calisphere-id": "oai:library.ucla.edu:ark:/21198/zz002br3c7", "campus_data": ["10::UCLA"], "campus_name": ["UCLA"], "campus_url": ["10"], "collection_data": ["28156::Deutsches Nachrichtenbüro"], "collection_name": ["Deutsches Nachrichtenbüro"], "collection_url": ["28156"], "date": ["February 15, 1938", "1938-02-15"], "fetcher_type": ["oai"], "id": "ark:/21198/zz002br3c7", "identifier": ["6620651_1938-02-15_0210", "ark:/21198/zz002br3c7"], "is_shown_at": "https://digital.library.ucla.edu/catalog/ark:/21198/zz002br3c7", "is_shown_by": "https://p-u-cantaloupe01.library.ucla.edu/cantaloupe/iiif/2/Masters%2Fdlmasters%2Fnachrichtenburo%2Fimage%2FFeb_1938%2F6620651_1938-02-15_0210.tif/full/!200,200/0/default.jpg", "item_count": 0, "language": ["German"], "mapper_type": ["oai.samvera"], "media_source": {}, "repository_data": ["34::Library Special Collections, Charles E. Young Research Library::UCLA"], "repository_name": ["Library Special Collections, Charles E. Young Research Library"], "repository_url": ["34"], "rights": ["Please contact the contributing institution for more information regarding the copyright status of this object."], "sort_collection_data": ["deutsches nachrichtenbro::Deutsches Nachrichtenbüro::28156"], "sort_title": "deutsches nachrichtenbro 5 jahrg nr 210 1938 february 15 nachmittags und abendausgabe", "source": ["Deutsches Nachrichtenbüro"], "thumbnail_source": "https://p-u-cantaloupe01.library.ucla.edu/cantaloupe/iiif/2/Masters%2Fdlmasters%2Fnachrichtenburo%2Fimage%2FFeb_1938%2F6620651_1938-02-15_0210.tif/full/!200,200/0/default.jpg", "title": ["Deutsches Nachrichtenbüro. 5 Jahrg., Nr. 210, 1938 February 15, Nachmittags- und Abend-Ausgabe"], "type": ["text"], "url_item": "https://digital.library.ucla.edu/catalog/ark:/21198/zz002br3c7"}, 

Here's the OAI: https://digital.library.ucla.edu/catalog/oai?verb=GetRecord&metadataPrefix=oai_dpla&identifier=oai:library.ucla.edu:ark:/21198/zz002br3c7

Which offers this representative image URL: https://p-u-cantaloupe01.library.ucla.edu/cantaloupe/iiif/2/Masters%2Fdlmasters%2Fnachrichtenburo%2Fimage%2FFeb_1938%2F6620651_1938-02-15_0210.tif/full/!200,200/0/default.jpg

From mapped task 1:

[2024-05-04, 00:24:11 UTC] {{standard_task_runner.py:104}} ERROR - Failed to execute job 105196 for task content_harvesting.content_harvest (This task is not in success state - last 100 logs from Cloudwatch:
urllib3.exceptions.ResponseError: too many 500 error responses
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
    return self.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 938, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='p-u-cantaloupe01.library.ucla.edu', port=443): Max retries exceeded with url: /cantaloupe/iiif/2/Masters%2Fdlmasters%2Fnachrichtenburo%2Fimage%2FFeb_1938%2F6620651_1938-02-15_0210.tif/full/!200,200/0/default.jpg (Caused by ResponseError('too many 500 error responses'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/content_harvester/by_page.py", line 98, in <module>
    print_value.append(harvest_page_content(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_page.py", line 31, in harvest_page_content
    record_with_content = harvest_record_content(
                          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 105, in harvest_record_content
    downloaded_md5 = download_content(request, http, thumbnail.tmp_filepath)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/content_harvester/by_record.py", line 191, in download_content
    response = http.get(**request)
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/requests/adapters.py", line 510, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='p-u-cantaloupe01.library.ucla.edu', port=443): Max retries exceeded with url: /cantaloupe/iiif/2/Masters%2Fdlmasters%2Fnachrichtenburo%2Fimage%2FFeb_1938%2F6620651_1938-02-15_0210.tif/full/!200,200/0/default.jpg (Caused by ResponseError('too many 500 error responses'))
Harvesting content for 25 records at 28156/vernacular_metadata_2024-05-04T00:07:26/mapped_metadata_2024-05-04T00:09:11/data/6.jsonl
Harvested 25 thumbnail records
25/25 described a thumbnail source
Source Thumbnail Mimetypes: Counter({None: 25})
Destination Thumbnail Mimetypes: Counter({'image/jpeg': 25})
Harvesting content for 25 records at 28156/vernacular_metadata_2024-05-04T00:07:26/mapped_metadata_2024-05-04T00:09:11/data/7.jsonl
Error downloading https://iiif.library.ucla.edu/iiif/2/ark%3A%2F21198%2Fzz002bt53z/full/!200,200/0/default.jpg: 401 Client Error: Unauthorized for url: https://iiif.library.ucla.edu/iiif/2/ark%3A%2F21198%2Fzz002bt53z/full/!200,200/0/default.jpg
ERROR: no thumbnail found for ['text']record oai:library.ucla.edu:ark:/21198/zz002bt53z in page 28156/vernacular_metadata_2024-05-04T00:07:26/mapped_metadata_2024-05-04T00:09:11/data/7.jsonl
Harvested 24 thumbnail records
25/25 described a thumbnail source
Source Thumbnail Mimetypes: Counter({None: 25})
Destination Thumbnail Mimetypes: Counter({'image/jpeg': 24, None: 1})
Harvesting content for 25 records at 28156/vernacular_metadata_2024-05-04T00:07:26/mapped_metadata_2024-05-04T00:09:11/data/8.jsonl
Harvested 25 thumbnail records
25/25 described a thumbnail source
Source Thumbnail Mimetypes: Counter({None: 25})
Destination Thumbnail Mimetypes: Counter({'image/jpeg': 25})
Harvesting content for 25 records at 28156/vernacular_metadata_2024-05-04T00:07:26/mapped_metadata_2024-05-04T00:09:11/data/9.jsonl
Harvested 25 thumbnail records
25/25 described a thumbnail source
Source Thumbnail Mimetypes: Counter({None: 25})
Destination Thumbnail Mimetypes: Counter({'image/jpeg': 25})
Harvesting content for 25 records at 28156/vernacular_metadata_2024-05-04T00:07:26/mapped_metadata_2024-05-04T00:09:11/data/10.jsonl; 3826)
[2024-05-04, 00:24:12 UTC] {{local_task_job_runner.py:225}} INFO - Task exited with return code 1
[2024-05-04, 00:24:12 UTC] {{taskinstance.py:2653}} INFO - 0 downstream tasks scheduled from follow-on schedule check
christinklez commented 6 months ago

This collection has been ETL'd. Once ETL'd, the Registry record was updated back to OAI-PMH.

To do: We still need to contact UCLA in order to unblock harvesting directly from their platform.