silknow / crawler

SILKNOW crawler that collects metadata records describing silk material from various museums
Apache License 2.0
2 stars 1 forks source link

crawler

SILKNOW crawler that collects metadata records describing silk material from various museums.

Prerequisites

How to install

You first need to install dependencies, by using npm:

npm install

How to run

The crawler takes one paramater: the name of the museum to be crawled. For example:

npm start -- mfa-boston

Available parameters:

Parameter Description
--no-files Do not download files such as photos
--no-records Do not write the JSON records
--list-fields Returns a list of unique fields from JSON records. Also takes a --format parameter (values: "md" or "markdown" for Markdown, "json" for JSON, defaults to Markdown)
--check-images Re-download images marked with the hasError flag

List of museums

Crawled JSON structure of each museum can be found here

Notes about UNIPA

The UNIPA crawler parses local files only. It requires a database.json along with an images folder. The data has to be stored in data/unipa/resources.

Link to the dataset: https://www.dropbox.com/sh/a8zzv22r59q67eq/AAB4SOAGf1byLFwakYkzbcYFa?dl=0

Notes about Paris Musées

The Paris Musées API requires to generate a token by following the Paris Musées API documentation.

Once a token has bene obtained, add the environment variable PARIS_MUSEES_TOKEN=<token> (replace <token> with the token) before running the crawler.

Notes about MET Museum

MET Museum implements an anti-scrapping strategy which requires to first open this page into a web browser, then open the browser's inspector and type in the console: document.cookie to get the cookies. It should look like this: "incap_ses_XXX_XXXXXXX=abcDEFgHIjkLmNoPQrSTUvWxyZABCDEFGHijklMNOPqrSTUVwXYZAb==".

Finally, add the environment variable MET_MUSEUM_COOKIE="incap_ses_XXX_XXXXXXX=abcDEFgHIjkLmNoPQrSTUvWxyZABCDEFGHijklMNOPqrSTUVwXYZAb==" (replace with your own cookie) before running the crawler.

This cookie is only valid for a limited amount of time, but it should be enough to crawl the entire collection.

Notes about Musée d'Art et d'Industrie (St Etienne)

The Musée d'Art et d'Industrie (St Etienne) crawler parses local files only. It requires a export silknow.tsv file along with an media folder. The data has to be stored in data/musee-st-etienne/resources.

Link to the dataset: https://drive.google.com/drive/folders/1V-p9cJ-lNtUtGHW1ePv_k4rLsd_xbbyb

Development

Add the environment variable DEBUG=silknow:* to also output the debug logs.