metmuseum / openaccess

The Metropolitan Museum of Art's Open Access Initiative
Creative Commons Zero v1.0 Universal
1.18k stars 212 forks source link

Include image url (for cc0 images?) #2

Open multichill opened 7 years ago

multichill commented 7 years ago

Awesome that you released the dataset. Had to use scrapi in the past and this is much better!

Taking http://metmuseum.org/art/collection/search/435809 as an example, this is the line: 19.164,True,True,435809,European Paintings,Painting,The Harvesters,,,,,,Artist,,Pieter Bruegel the Elder,"Netherlandish, Breda (?) ca. 1525–1569 Brussels",,"Bruegel, Pieter, the Elder",Netherlandish,"1525 ","1569 ",1565,1565,1565,Oil on wood,"Overall, including added strips at top, bottom, and right, 46 7/8 x 63 3/4 in. (119 x 162 cm); original painted surface 45 7/8 x 62 7/8 in. (116.5 x 159.5 cm)","Rogers Fund, 1919",,,,,,,,,,,,Paintings,,http://www.metmuseum.org/art/collection/search/435809,2/6/2017 8:00:16 AM,"Metropolitan Museum of Art, New York, NY"

This doesn't include a link to the high resolution cc0 image, in this case http://images.metmuseum.org/CRDImages/ep/original/DP119115.jpg . Would it be possible to include this url as an additional field? Probably best to only do it for records where "Is Public Domain" is set to True.

This makes it easier to share the images on Wikimedia Commons and illustrate things like https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings/Collection/Metropolitan_Museum_of_Art

andreitr commented 7 years ago

Or at least a way to cross reference product id with an image id. Eg: item id 802 > DP104414(.jpg)

sotojuan commented 7 years ago

Agreed—this would make it easier to build cool apps on top of this file!

ewg118 commented 7 years ago

I'd like to reiterate the need to making the images more accessible and harvestable. I can't even screen scrape associated thumbnail and high-res image files out of the HTML page, given the object URI. Could you at least provide documentation on how to use your internal additionalImages service?

avitalp commented 7 years ago

I've done some work with this dataset that may be of use. It includes a simple scraper that will grab the full sized image links as well as collection details. I've also converted the CSV file into a simple MySQL table and imported the data into it. The links and collection details are available for items classified as paintings but you can easily modify it for your needs. I've put it up at https://github.com/avitalp/metmuseum-oa-explore

sotojuan commented 7 years ago

For those in the JS ecosystem: I've created a module that will take an Object ID (fourth column in the CSV) and give you the image URL, if available:

https://github.com/sotojuan/get-met-url

sotojuan commented 7 years ago

I've created a small service that will give you the URL for an Object ID:

You can use it like:

https://met.juanbox.co/437853

to be redirected to the image link or

https://met.juanbox.co/api/437853

to get JSON data instead.

More info in the repo.

multichill commented 7 years ago

Nice @sotojuan , but you don't seem to be able to handle pages like http://www.metmuseum.org/art/collection/search/459119 , see http://met.juanbox.co/api/459119 . I would expect one default url and a list with all the urls.

sotojuan commented 7 years ago

Ah I probably messed something up. Thanks—I needed more user testing :-) I'll see what I can do after work.

sotojuan commented 7 years ago

@multichill I actually just had my regex wrong—that link should be working now. Thanks :-)

multichill commented 7 years ago

@sotojuan Just clicked around a bit and noticed http://www.metmuseum.org/api/Collection/additionalImages?crdId=437853 and http://www.metmuseum.org/api/Collection/additionalImages?crdId=459119 . That might be even easier!

CSV is fun, but an api that returns JSON is more fun. See http://www.metmuseum.org/api/collection/collectionlisting?artist=&department=&era=&geolocation=&material=Paintings&offset=0&pageSize=0&perPage=20&showOnly=&sortBy=Relevance&sortOrder=asc

Bit of reverse engineering, but quite easy to figure out

sotojuan commented 7 years ago

Cool, and that's also faster—I'll update both the site and the npm module to use this. Thanks again.

gregsadetsky commented 6 years ago

I just made available a CSV file which links Met Object IDs to Open Access image URLs. It's available here:

https://github.com/gregsadetsky/met-openaccess-images

hamidzr commented 5 years ago

look's like scraping the page is not an option: https://www.incapsula.com/website-security/access-control.html

gregsadetsky commented 5 years ago

@hamidzr I'm not sure that I follow. Have you seen anti-scraping measures when downloading CC0 images from the Met?

hamidzr commented 5 years ago

@gregsadetsky If I'm manually getting it no, but If I have a program which tries to automate the process it gets blocked after getting one or two images. I'm developing an "art service" for a nonprofit educational platform and I was hoping I would integrate the open-access data from metmuseum but the metadata without a way of getting image urls would not be as interesting for art projects to students. Here is part of what I was working on: https://github.com/NetsBlox/NetsBlox/tree/1451-metmuseum/src/server/rpc/procedures/met-museum

for example if I try the collection listing api mentioned here through curl or postman I get

<html>
    <head>
        <META NAME="robots" CONTENT="noindex,nofollow">
        <script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">
</script>
        <body>
</body>
    </html>

Have you seen anti-scraping measures when downloading CC0 images from the Met?

So you could say I haven't seen it when downloading the image because I can't programmatically get to the image URL, at least not without jumping through hoops. Since the museum promotes open access I was hoping they would provide a clearcut transparent way of getting those images programmatically.

gregsadetsky commented 5 years ago

@hamidzr I agree that it's strange that the Met provides this open CSV with no references to the image URLs..!

You could use this repo which I created about a year ago which links the Met object IDs with image URLs. I haven't updated it since, so there are probably new object IDs in the latest CSV for which the repo doesn't provide Image URLs (it should still be useful, as it contains ~380k URLs)

hamidzr commented 5 years ago

Thanks that will certainly cover a large portion of the objects, I assume you didn't manually pull all this information. Given the current situation do you think you can rerun your solution to update your dataset

gregsadetsky commented 5 years ago

Just opened an issue on the repo to track this. I'll try to find the time! Cheers

multichill commented 5 years ago

FYI the met has an API now, see https://metmuseum.github.io/ . So for example https://collectionapi.metmuseum.org/public/collection/v1/objects/436535 includes links to images.

hamidzr commented 5 years ago

@multichill thanks! i had seen their search API but that is restricted from bot access so I didn't look further this one looks open

gregsadetsky commented 3 years ago

Hey all, I've gone through the most recent Met CSV file and extracted all image URLs. I've also created a new repository for this, as I did something similar for the Art Institute of Chicago, which also publishes data through Open Access, but similarly does not make it easy to find the image urls.

You can find all of this here -- https://github.com/gregsadetsky/open-access-is-great-but-where-are-the-images

iplanwebsites commented 1 year ago

+1 Altought @gregsadetsky solution works, I can't believe we need to scape 500,000 pages to get image urls.