silknow / crawler

SILKNOW crawler that collects metadata records describing silk material from various museums
Apache License 2.0
2 stars 1 forks source link

Update RISD Museum crawler #3

Closed ehrhart closed 5 years ago

ehrhart commented 5 years ago

The website of RISD museum has been redesigned.

Major changes:

Note: prior to the new changes, we had 689 records and 1958 photos.

ehrhart commented 5 years ago

We have two options for limiting the results:

Option 1. Use the keyword "silk" with type set to Textiles. Search results take a lot of time to display and sometimes the request times out. When it works, we get 3381 results, but potentially some false positives because of the keyword search.

Option 2. Set the type to Textiles. We get 7410 results which can be filtered during the scraping, by checking if the field "Materials" contains the value "silk".

Due to the instability of Option 1, we will use Option 2 for now.

The crawling process is in progress. Once it is done I will update this issue with the number of actual results (post-filtering)

ehrhart commented 5 years ago

With Option 2, after filtering (by checking if the field "Materials" contains the value "silk"), we get 3312 results, which is quite close to the number of results from Option 1.

rtroncy commented 5 years ago

Good, can you push this new metadata dump on the owncloud server?

ehrhart commented 5 years ago

The owncloud server contains the latest version of the JSON files.