ucldc / rikolti

calisphere harvester 2.0
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

[data provider issue] `oai.samvera` records with blank thumbnails -- `isShownBy` URL formatting (9 records from 5 collections) #921

Open christinklez opened 4 months ago

christinklez commented 4 months ago

Mapper: oai.samvera Problem: The OAI for certain records do not include the full/!200,200/0/default.jpg tail in the URL construction. Adding these in (manually) point to a working image file. Observation: This seems to occur for select records that have ARKs that contain the text characters png or jpg. Proposed next step: We need to ask UCLA to fix these URLs.

Note: This is not harvest-stopping error, and these collections are currently on -stage! The records in question have a grey tile for a thumbnail.

To Do:

Registry ID: 153 - errors in 4 mapped index

oai:library.ucla.edu:ark:/21198/zz002dcpng

oai:library.ucla.edu:ark:/21198/zz0002pngj

oai:library.ucla.edu:ark:/21198/zz0002nzwn

oai:library.ucla.edu:ark:/21198/zz002ctjpg

oai:library.ucla.edu:ark:/21198/zz002cpng1

Registry ID: 154 - errors in 2 mapped index

oai:library.ucla.edu:ark:/21198/zz00280jpg

oai:library.ucla.edu:ark:/21198/zz00288png

Registry ID: 28108 - errors in 1 mapped index

oai:library.ucla.edu:ark:/21198/zz002hpngp

Registry ID: 28111 - errors in 1 mapped index

Run ID: manual__2024-05-04T00:00:02+00:00

oai:library.ucla.edu:ark:/21198/zz0025pngj

Registry ID: 28230 - error in 1 mapped index

oai:library.ucla.edu:ark:/21198/zz0002fpng

barbarahui commented 4 months ago

@christinklez You can investigate these by downloading the mapped metadata, searching for the item, and checking the is_shown_by. For example, for the first error, I can see in the log that the mapped metadata file is 154/vernacular_metadata_2024-05-03T23:59:02/mapped_metadata_2024-05-04T00:01:07/data/80.jsonl.

I download this file from S3 and search for oai:library.ucla.edu:ark:/21198/zz002hpngp. The is_shown_by for this record is https://iiif.library.ucla.edu/iiif/2/ark%3A%2F21198%2Fzz00280jpg . When I go to this URL in a browser, I get an info.json page. This is not actually an image, which is what is causing the content harvester to throw an error.

I'm not sure if this is a provider issue or a rikolti mapper error? I'm guessing it's a provider issue since we were able to get thumbnails for most of the items in this collection.

barbarahui commented 4 months ago

For collection 154, most of the objects have is_shown_by URLs with this kind of format:

https://iiif.library.ucla.edu/iiif/2/ark%3A%2F21198%2Fzz00288pmz/full/!200,200/0/default.jpg

The 2 objects that are failing have these is_shown_by URLs:

oai:library.ucla.edu:ark:/21198/zz00280jpg - https://iiif.library.ucla.edu/iiif/2/ark%3A%2F21198%2Fzz00280jpg oai:library.ucla.edu:ark:/21198/zz00288png - https://iiif.library.ucla.edu/iiif/2/ark%3A%2F21198%2Fzz00288png

If I add /full/!200,200/0/default.jpg to the end, then I get a thumbnail image:

oai:library.ucla.edu:ark:/21198/zz00280jpg - https://iiif.library.ucla.edu/iiif/2/ark%3A%2F21198%2Fzz00280jpg/full/!200,200/0/default.jpg oai:library.ucla.edu:ark:/21198/zz00288png - https://iiif.library.ucla.edu/iiif/2/ark%3A%2F21198%2Fzz00288png/full/!200,200/0/default.jpg

christinklez commented 4 months ago

Look at the vernacular metadata to review the isShownBy thumbnail URLs.

christinklez commented 4 months ago

Looked at the OAI for each of these records, and confirming that they do not have the full/!200,200/0/default.jpg at the tail of the IIIF URLs. We can plan to send a heads up to UCLA about these records.

In the meantime, we can consider ETL'ing these collections instead.

christinklez commented 4 months ago

Thanks to #899, we're able to get these collections through to -stage! But these records (as expected) have broken thumbnails.

153

(map index 4, 5, 14, 35) https://calisphere-stage.cdlib.org/item/ark:/21198/zz002dcpng https://calisphere-stage.cdlib.org/item/ark:/21198/zz0002pngj/ https://calisphere-stage.cdlib.org/item/ark:/21198/zz002cpng1/

154

(map index 16, 37) https://calisphere-stage.cdlib.org/item/ark:/21198/zz00280jpg/ https://calisphere-stage.cdlib.org/item/ark:/21198/zz00288png/

28108

(map index 5) https://calisphere-stage.cdlib.org/item/ark:/21198/zz002hpngp/

28111

(map index 23) https://calisphere-stage.cdlib.org/item/ark:/21198/zz0025pngj/

28230

(map index 29) https://calisphere-stage.cdlib.org/item/ark:/21198/zz0002fpng/