lucasdavid / wikiart

Full retriever for art and metadata in http://wikiart.org/
MIT License
229 stars 57 forks source link

'media" field not being scraped #1

Closed stevedipaola closed 7 years ago

stevedipaola commented 7 years ago

many artists such as van gogh have a 'media' attribute which helps for the work we want ( for academic ML corpus) as it states whether it is a paint or drawing ( oil, ...) ... yet the scraper does not save the media field in the json files. I tried adding 'media' to the attributes in the settings.py - but still I am not seeing it in the output. Is there a way to fix this - 'media" has very useful meta data.

lucasdavid commented 7 years ago

Sorry for taking so long to answer this. For some reason I didn't get notified.

The list of attributes in settings.py is used by the WikiArtMetadataConverter class. It defines which of those will be considered when converting the .json files into the dataset list. This is useful if you don't need all information and want to make the dataset smaller. Therefore, it does not affect the info fetched. In fact, the WikiArtFetcher class downloads the paintings' metadata as they were retrieved from the API. So all attributes should be inside the saved file. For example:

GET https://www.wikiart.org/en/App/Painting/ImageJson/207241

{
    "artistUrl": "vincent-van-gogh",
    "url": "town-d-avray-l-etang-au-batelier-1875",
    "dictionaries": [
        465,
        1192
    ],
    "location": "LondonUnited Kingdom",
    "period": null,
    "serie": null,
    "genre": "sketch and study",
    "material": null,
    "style": "Realism",
    "technique": null,
    "sizeX": null,
    "sizeY": null,
    "diameter": null,
    "auction": null,
    "yearOfTrade": null,
    "lastPrice": null,
    "galleryName": "Van Gogh Museum, Amsterdam, Netherlands",
    "tags": "forests-and-trees",
    "description": null,
    "title": "Town d'Avray: L'Etang au Batelier",
    "contentId": 207241,
    "artistContentId": 204915,
    "artistName": "van Gogh Vincent ",
    "completitionYear": 1875,
    "yearAsString": "1875",
    "width": 801,
    "image": "https://uploads7.wikiart.org/images/vincent-van-gogh/town-d-avray-l-etang-au-batelier-1875.jpg!Large.jpg",
    "height": 1024
}

Notice that this media attribute doesn't appear on "Town d'Avray: L'Etang au Batelier". Can you tell me one painting's contentId s.t. this attribute happen to exist?

stevedipaola commented 7 years ago

yes there is a "media" label in many of the artists like here in van-gogh]

for instance:

https://www.wikiart.org/en/vincent-van-gogh/the-potato-eaters-1885-1 it has

Media: oil https://www.wikiart.org/en/paintings-by-media/oil, canvas https://www.wikiart.org/en/paintings-by-media/canvas

https://www.wikiart.org/en/vincent-van-gogh/a-man-and-a-woman-seen-from-the-back-1886 it has

Media: chalk https://www.wikiart.org/en/paintings-by-media/chalk, paper https://www.wikiart.org/en/paintings-by-media/paper

and so on


Not sure why media for is in the output but not in the DB. Ant thoughts. Admittedly there media lab el is not used everywhere. It is tricky for the AI work we are doing because it is nice to sort out say drawing and only use paintings for filtering for a corpus.

On Wed, Feb 8, 2017 at 9:00 AM, Lucas David notifications@github.com wrote:

Sorry for taking so long to answer this. For some reason I didn't get notified.

The list of attributes in settings.py is used by the WikiArtMetadataConverter class. It defines which of those will be considered when converting the .json files into the dataset list. This is useful if you don't need all information and want to make the dataset smaller. Therefore, it does not affect the info fetched. In fact, the WikiArtFetcher class downloads the paintings' metadata as they were retrieved from the API. So all attributes should be inside the saved file. For example:

GET https://www.wikiart.org/en/App/Painting/ImageJson/207241

{ "artistUrl": "vincent-van-gogh", "url": "town-d-avray-l-etang-au-batelier-1875", "dictionaries": [ 465, 1192 ], "location": "LondonUnited Kingdom", "period": null, "serie": null, "genre": "sketch and study", "material": null, "style": "Realism", "technique": null, "sizeX": null, "sizeY": null, "diameter": null, "auction": null, "yearOfTrade": null, "lastPrice": null, "galleryName": "Van Gogh Museum, Amsterdam, Netherlands", "tags": "forests-and-trees", "description": null, "title": "Town d'Avray: L'Etang au Batelier", "contentId": 207241, "artistContentId": 204915, "artistName": "van Gogh Vincent ", "completitionYear": 1875, "yearAsString": "1875", "width": 801, "image": "https://uploads7.wikiart.org/images/vincent-van-gogh/town-d-avray-l-etang-au-batelier-1875.jpg!Large.jpg", "height": 1024 }

Notice that this media attribute doesn't appear on "Town d'Avray: L'Etang au Batelier". Can you tell me one painting's contentId s.t. this attribute happen to exist?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lucasdavid/wikiart/issues/1#issuecomment-278389899, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUZ5FGQ7YAZYN4WuLPSsgUx46AVyPTYks5rafTCgaJpZM4LyaDW .

lucasdavid commented 7 years ago

Oh, you were talking about the website... Okay, but you see, we have a problem here: although this information is available in the site, it is not through the API! Take "The Potato Eaters" as an example and navigate to https://www.wikiart.org/en/App/Painting/ImageJson/205983 (this painting's endpoint). You won't be able to see the media property there.

A possible workaround is to craw each painting' wikiart webpage and parse that information out, after the fetcher has executed. Once all information has been retrieved, just update the json entries with this new property. The crawling & parsing part can be achieve using BeautifulSoup. If you do get to that, I'd appreciate if you did a pull-request here. :-)

Note: If people at wikiart eventually decide to make the "media" property available through the API, the fetcher will automatically retrieve it as well, and the workaround above can be simply removed.

Let me know if I can help you with more information.