[data provider issue] `oai.samvera` content harvesting `urllib3.exceptions.ResponseError: too many 500 error responses` - obsolete(?) image server (several records from 1 collection) #923
UCLA collection 28156 has a bunch of images being served by an obsolete(?) IIIF image server. (Some images are being served by their current IIIF server, but there are some that still point to Cantaloupe.)
This is a harvest-stopping error.
To do:
[ ] Contact UCLA about this error
[ ] Reharvest 28156
Registry ID: 28156
8 mapped tasks resulting in content harvesting errors
This URL comes from the error log: host p-u-cantaloupe01.library.ucla.edu & /cantaloupe/iiif/2/Masters%2Fdlmasters%2Fnachrichtenburo%2Fimage%2FFeb_1938%2F6620651_1938-02-15_0210.tif/full/!200,200/0/default.jpg
Here's the mapped metadata:
{"alternative_title": ["Nachmittags- und Abend-Ausgabe", "Deutsches Nachrichtenbüro"], "calisphere-id": "oai:library.ucla.edu:ark:/21198/zz002br3c7", "campus_data": ["10::UCLA"], "campus_name": ["UCLA"], "campus_url": ["10"], "collection_data": ["28156::Deutsches Nachrichtenbüro"], "collection_name": ["Deutsches Nachrichtenbüro"], "collection_url": ["28156"], "date": ["February 15, 1938", "1938-02-15"], "fetcher_type": ["oai"], "id": "ark:/21198/zz002br3c7", "identifier": ["6620651_1938-02-15_0210", "ark:/21198/zz002br3c7"], "is_shown_at": "https://digital.library.ucla.edu/catalog/ark:/21198/zz002br3c7", "is_shown_by": "https://p-u-cantaloupe01.library.ucla.edu/cantaloupe/iiif/2/Masters%2Fdlmasters%2Fnachrichtenburo%2Fimage%2FFeb_1938%2F6620651_1938-02-15_0210.tif/full/!200,200/0/default.jpg", "item_count": 0, "language": ["German"], "mapper_type": ["oai.samvera"], "media_source": {}, "repository_data": ["34::Library Special Collections, Charles E. Young Research Library::UCLA"], "repository_name": ["Library Special Collections, Charles E. Young Research Library"], "repository_url": ["34"], "rights": ["Please contact the contributing institution for more information regarding the copyright status of this object."], "sort_collection_data": ["deutsches nachrichtenbro::Deutsches Nachrichtenbüro::28156"], "sort_title": "deutsches nachrichtenbro 5 jahrg nr 210 1938 february 15 nachmittags und abendausgabe", "source": ["Deutsches Nachrichtenbüro"], "thumbnail_source": "https://p-u-cantaloupe01.library.ucla.edu/cantaloupe/iiif/2/Masters%2Fdlmasters%2Fnachrichtenburo%2Fimage%2FFeb_1938%2F6620651_1938-02-15_0210.tif/full/!200,200/0/default.jpg", "title": ["Deutsches Nachrichtenbüro. 5 Jahrg., Nr. 210, 1938 February 15, Nachmittags- und Abend-Ausgabe"], "type": ["text"], "url_item": "https://digital.library.ucla.edu/catalog/ark:/21198/zz002br3c7"},
[2024-05-04, 00:24:11 UTC] {{standard_task_runner.py:104}} ERROR - Failed to execute job 105196 for task content_harvesting.content_harvest (This task is not in success state - last 100 logs from Cloudwatch:
urllib3.exceptions.ResponseError: too many 500 error responses
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
return self.urlopen(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
return self.urlopen(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 948, in urlopen
return self.urlopen(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/urllib3/connectionpool.py", line 938, in urlopen
retries = retries.increment(method, url, response=response, _pool=self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/urllib3/util/retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='p-u-cantaloupe01.library.ucla.edu', port=443): Max retries exceeded with url: /cantaloupe/iiif/2/Masters%2Fdlmasters%2Fnachrichtenburo%2Fimage%2FFeb_1938%2F6620651_1938-02-15_0210.tif/full/!200,200/0/default.jpg (Caused by ResponseError('too many 500 error responses'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/content_harvester/by_page.py", line 98, in <module>
print_value.append(harvest_page_content(
^^^^^^^^^^^^^^^^^^^^^
File "/content_harvester/by_page.py", line 31, in harvest_page_content
record_with_content = harvest_record_content(
^^^^^^^^^^^^^^^^^^^^^^^
File "/content_harvester/by_record.py", line 105, in harvest_record_content
downloaded_md5 = download_content(request, http, thumbnail.tmp_filepath)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/content_harvester/by_record.py", line 191, in download_content
response = http.get(**request)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 602, in get
return self.request("GET", url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/requests/adapters.py", line 510, in send
raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='p-u-cantaloupe01.library.ucla.edu', port=443): Max retries exceeded with url: /cantaloupe/iiif/2/Masters%2Fdlmasters%2Fnachrichtenburo%2Fimage%2FFeb_1938%2F6620651_1938-02-15_0210.tif/full/!200,200/0/default.jpg (Caused by ResponseError('too many 500 error responses'))
Harvesting content for 25 records at 28156/vernacular_metadata_2024-05-04T00:07:26/mapped_metadata_2024-05-04T00:09:11/data/6.jsonl
Harvested 25 thumbnail records
25/25 described a thumbnail source
Source Thumbnail Mimetypes: Counter({None: 25})
Destination Thumbnail Mimetypes: Counter({'image/jpeg': 25})
Harvesting content for 25 records at 28156/vernacular_metadata_2024-05-04T00:07:26/mapped_metadata_2024-05-04T00:09:11/data/7.jsonl
Error downloading https://iiif.library.ucla.edu/iiif/2/ark%3A%2F21198%2Fzz002bt53z/full/!200,200/0/default.jpg: 401 Client Error: Unauthorized for url: https://iiif.library.ucla.edu/iiif/2/ark%3A%2F21198%2Fzz002bt53z/full/!200,200/0/default.jpg
ERROR: no thumbnail found for ['text']record oai:library.ucla.edu:ark:/21198/zz002bt53z in page 28156/vernacular_metadata_2024-05-04T00:07:26/mapped_metadata_2024-05-04T00:09:11/data/7.jsonl
Harvested 24 thumbnail records
25/25 described a thumbnail source
Source Thumbnail Mimetypes: Counter({None: 25})
Destination Thumbnail Mimetypes: Counter({'image/jpeg': 24, None: 1})
Harvesting content for 25 records at 28156/vernacular_metadata_2024-05-04T00:07:26/mapped_metadata_2024-05-04T00:09:11/data/8.jsonl
Harvested 25 thumbnail records
25/25 described a thumbnail source
Source Thumbnail Mimetypes: Counter({None: 25})
Destination Thumbnail Mimetypes: Counter({'image/jpeg': 25})
Harvesting content for 25 records at 28156/vernacular_metadata_2024-05-04T00:07:26/mapped_metadata_2024-05-04T00:09:11/data/9.jsonl
Harvested 25 thumbnail records
25/25 described a thumbnail source
Source Thumbnail Mimetypes: Counter({None: 25})
Destination Thumbnail Mimetypes: Counter({'image/jpeg': 25})
Harvesting content for 25 records at 28156/vernacular_metadata_2024-05-04T00:07:26/mapped_metadata_2024-05-04T00:09:11/data/10.jsonl; 3826)
[2024-05-04, 00:24:12 UTC] {{local_task_job_runner.py:225}} INFO - Task exited with return code 1
[2024-05-04, 00:24:12 UTC] {{taskinstance.py:2653}} INFO - 0 downstream tasks scheduled from follow-on schedule check
UCLA collection 28156 has a bunch of images being served by an obsolete(?) IIIF image server. (Some images are being served by their current IIIF server, but there are some that still point to Cantaloupe.)
This is a harvest-stopping error.
To do:
Registry ID: 28156
identifier: 6620651_1938-02-15_0210
This URL comes from the error log: host
p-u-cantaloupe01.library.ucla.edu
&/cantaloupe/iiif/2/Masters%2Fdlmasters%2Fnachrichtenburo%2Fimage%2FFeb_1938%2F6620651_1938-02-15_0210.tif/full/!200,200/0/default.jpg
Here's the mapped metadata:
Here's the OAI: https://digital.library.ucla.edu/catalog/oai?verb=GetRecord&metadataPrefix=oai_dpla&identifier=oai:library.ucla.edu:ark:/21198/zz002br3c7
Which offers this representative image URL: https://p-u-cantaloupe01.library.ucla.edu/cantaloupe/iiif/2/Masters%2Fdlmasters%2Fnachrichtenburo%2Fimage%2FFeb_1938%2F6620651_1938-02-15_0210.tif/full/!200,200/0/default.jpg
From mapped task 1: