Open multichill opened 7 years ago
Or at least a way to cross reference product id with an image id. Eg: item id 802 > DP104414(.jpg)
Agreed—this would make it easier to build cool apps on top of this file!
I'd like to reiterate the need to making the images more accessible and harvestable. I can't even screen scrape associated thumbnail and high-res image files out of the HTML page, given the object URI. Could you at least provide documentation on how to use your internal additionalImages service?
I've done some work with this dataset that may be of use. It includes a simple scraper that will grab the full sized image links as well as collection details. I've also converted the CSV file into a simple MySQL table and imported the data into it. The links and collection details are available for items classified as paintings but you can easily modify it for your needs. I've put it up at https://github.com/avitalp/metmuseum-oa-explore
For those in the JS ecosystem: I've created a module that will take an Object ID (fourth column in the CSV) and give you the image URL, if available:
I've created a small service that will give you the URL for an Object ID:
You can use it like:
to be redirected to the image link or
https://met.juanbox.co/api/437853
to get JSON data instead.
More info in the repo.
Nice @sotojuan , but you don't seem to be able to handle pages like http://www.metmuseum.org/art/collection/search/459119 , see http://met.juanbox.co/api/459119 . I would expect one default url and a list with all the urls.
Ah I probably messed something up. Thanks—I needed more user testing :-) I'll see what I can do after work.
@multichill I actually just had my regex wrong—that link should be working now. Thanks :-)
@sotojuan Just clicked around a bit and noticed http://www.metmuseum.org/api/Collection/additionalImages?crdId=437853 and http://www.metmuseum.org/api/Collection/additionalImages?crdId=459119 . That might be even easier!
CSV is fun, but an api that returns JSON is more fun. See http://www.metmuseum.org/api/collection/collectionlisting?artist=&department=&era=&geolocation=&material=Paintings&offset=0&pageSize=0&perPage=20&showOnly=&sortBy=Relevance&sortOrder=asc
Bit of reverse engineering, but quite easy to figure out
Cool, and that's also faster—I'll update both the site and the npm module to use this. Thanks again.
I just made available a CSV file which links Met Object IDs to Open Access image URLs. It's available here:
look's like scraping the page is not an option: https://www.incapsula.com/website-security/access-control.html
@hamidzr I'm not sure that I follow. Have you seen anti-scraping measures when downloading CC0 images from the Met?
@gregsadetsky If I'm manually getting it no, but If I have a program which tries to automate the process it gets blocked after getting one or two images. I'm developing an "art service" for a nonprofit educational platform and I was hoping I would integrate the open-access data from metmuseum but the metadata without a way of getting image urls would not be as interesting for art projects to students. Here is part of what I was working on: https://github.com/NetsBlox/NetsBlox/tree/1451-metmuseum/src/server/rpc/procedures/met-museum
for example if I try the collection listing api mentioned here through curl or postman I get
<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">
</script>
<body>
</body>
</html>
Have you seen anti-scraping measures when downloading CC0 images from the Met?
So you could say I haven't seen it when downloading the image because I can't programmatically get to the image URL, at least not without jumping through hoops. Since the museum promotes open access I was hoping they would provide a clearcut transparent way of getting those images programmatically.
@hamidzr I agree that it's strange that the Met provides this open CSV with no references to the image URLs..!
You could use this repo which I created about a year ago which links the Met object IDs with image URLs. I haven't updated it since, so there are probably new object IDs in the latest CSV for which the repo doesn't provide Image URLs (it should still be useful, as it contains ~380k URLs)
Thanks that will certainly cover a large portion of the objects, I assume you didn't manually pull all this information. Given the current situation do you think you can rerun your solution to update your dataset
Just opened an issue on the repo to track this. I'll try to find the time! Cheers
FYI the met has an API now, see https://metmuseum.github.io/ . So for example https://collectionapi.metmuseum.org/public/collection/v1/objects/436535 includes links to images.
@multichill thanks! i had seen their search API but that is restricted from bot access so I didn't look further this one looks open
Hey all, I've gone through the most recent Met CSV file and extracted all image URLs. I've also created a new repository for this, as I did something similar for the Art Institute of Chicago, which also publishes data through Open Access, but similarly does not make it easy to find the image urls.
You can find all of this here -- https://github.com/gregsadetsky/open-access-is-great-but-where-are-the-images
+1 Altought @gregsadetsky solution works, I can't believe we need to scape 500,000 pages to get image urls.
Awesome that you released the dataset. Had to use scrapi in the past and this is much better!
Taking http://metmuseum.org/art/collection/search/435809 as an example, this is the line:
19.164,True,True,435809,European Paintings,Painting,The Harvesters,,,,,,Artist,,Pieter Bruegel the Elder,"Netherlandish, Breda (?) ca. 1525–1569 Brussels",,"Bruegel, Pieter, the Elder",Netherlandish,"1525 ","1569 ",1565,1565,1565,Oil on wood,"Overall, including added strips at top, bottom, and right, 46 7/8 x 63 3/4 in. (119 x 162 cm); original painted surface 45 7/8 x 62 7/8 in. (116.5 x 159.5 cm)","Rogers Fund, 1919",,,,,,,,,,,,Paintings,,http://www.metmuseum.org/art/collection/search/435809,2/6/2017 8:00:16 AM,"Metropolitan Museum of Art, New York, NY"
This doesn't include a link to the high resolution cc0 image, in this case http://images.metmuseum.org/CRDImages/ep/original/DP119115.jpg . Would it be possible to include this url as an additional field? Probably best to only do it for records where "Is Public Domain" is set to True.
This makes it easier to share the images on Wikimedia Commons and illustrate things like https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Collection/Metropolitan_Museum_of_Art