miller-center / cpc-issues

Connecting Presidential Collections
Other
0 stars 0 forks source link

Adding thumbnail images to item-level records #10

Open sblackford opened 10 years ago

sblackford commented 10 years ago

We would like to add thumbnail images to the item-level records but we are not sure the best way to do that. Does Blacklight have any functionality to do that?

waldoj commented 10 years ago

For example, consider this record, to select one wholly at random. That's from The Sixth Floor Museum at Dealey Plaza, and can be seen on their site. The image HTML looks like this:

<img height="300" src="/internal/media/dispatcher/17585/resize:format=preview;jsessionid=DB94F3A1C54513B7512C630F55CC29FD" width="300"/>

This is quite easy to extract, because it's the only image on the page with a URL that includes /internal/media/. So we might write a scraper for The Sixth Floor Museum that grabs the HTML of the item URL, and extracts the image URL. That ends the bespoke functionality for The Sixth Floor Museum, and from there the URL can be passed off to a common function that trims the whitespace off the image (in our instant case, there's quite a bit), resizes it to our prescribed limits (e.g, no more than 300 pixels in either dimension, 72 dpi, JPEG at 40% quality), and saves it to our repository.

This whole process is quite trivial.

A TR Center record is just as easy—the image URL is <img src="http://trcimages.dataformat.com/images/LC/1600/029/0200/0259.jpg" width="344" alt="" /> , and while there's nothing about the image URL that's unique, it's the only such image URL within div.recordInfo, so it's simple to extract.

Extracting data from the Massachusetts Historical Society's Adams collection is a bit trickier, because it's not very well structured (e.g.). It's just as solvable in the same ways, but it'll require a bit more research. Worst-case, we need to store a list of all of their template images (the organization's logo, etc.), grab the HTML of the record, extract all of the image references, eliminate anything that's found in the template, and assume that the largest image is our thumbnail. For poorly structured sites, that's not bad. To be clear, I don't think that would be required for MHS, but it would work.