Centraal Museum Utrecht Modecollectie

bennokr commented 10 years ago

logo

The Centraal Museum Modecollectie consists of fashion items in the Centraal Museum Utrecht. The data is in a static XML file on dropbox, which I mirrored as a gist (8MB) to be able to HTTP request it. This should probably be mirrored instead on OCD servers, as described in issue #12?

Notebook

In order to explore the data, @Gijs-Koot and I used this iPython notebook for easy debugging and visualising. Have a look to see what the dataset's like. It makes things a lot easier!

Date granularity parsing

In addition to the static XML extractor and the items, this pull request contains a general way of dealing with messy date strings. The function ocd_backend.utils.misc.parse_date parses a date string using a collection of regexen, which look for question marks and stuff like that and reconstruct a granular date. The function ocd_backend.utils.misc.parse_date_span takes two messy date strings, representing an estimated start and end date, and reconstructs a granular date based on the first and the difference between them. For example, in the CMU regexen, the dates 196? and None are transformed into 1960-01-01T00:00:00 with granularity 3. Input on this approach is very welcome!

ajslaghu commented 10 years ago

http://www.opencultuurdata.nl/wp-content/uploads/2014/06/kostuums-CM-totaal.zip

is the link of the content on wordpress ( i assumed unziping is not an issue).

Also there is a debate on where questions on the datasets can be asked. Some suggest the datawiki on open cultuur data, perhaps there are some alternative options. I have the feeling this topic coincides with hosting of static files.

Anyone?

Lex Slaghuis

Van: Benno Kruit [mailto:notifications@github.com] Verzonden: zondag 1 juni 2014 21:33 Aan: openstate/open-cultuur-data Onderwerp: [open-cultuur-data] Centraal Museum Utrecht Modecollectie (#23)

[Afbeelding verwijderd door afzender. logo]https://cloud.githubusercontent.com/assets/994319/3142462/ff798ffe-e9bf-11e3-93f8-0b80a7336b90.png

The Centraal Museum Modecollectiehttp://www.opencultuurdata.nl/wiki/centraal-museum/ consists of fashion items in the Centraal Museum Utrecht. The data is in a static XML file on dropboxhttp://bit.ly/CM18062012, which I mirrored as a gisthttps://gist.githubusercontent.com/bennokr/8cdc528fde1d2a3358d5/raw/308f84452611a77671081e9c7df1bc8139ec3bbd/cmu.xml (8MB) to be able to HTTP request it. This should probably be mirrored instead on OCD servers, as described in issue #12https://github.com/openstate/open-cultuur-data/issues/12?

Notebook

In order to explore the data, @Gijs-Koothttps://github.com/Gijs-Koot and I used this iPython notebookhttp://nbviewer.ipython.org/github/bennokr/open-cultuur-data/blob/ed4721d16ae4fe2829a54a7cbf9e87204a6613f8/parse_centraal_museum_notebook.ipynb for easy debugging and visualising. Have a look to see what the dataset's like. It makes things a lot easier!

Date granularity parsing

In addition to the static XML extractor and the items, this pull request contains a general way of dealing with messy date strings. The function ocd_backend.utils.misc.parse_date parses a date string using a collection of regexen, which look for question marks and stuff like that and reconstruct a granular date. The function ocd_backend.utils.misc.parse_date_span takes two messy date strings, representing an estimated start and end date, and reconstructs a granular date based on the first and the difference between them. For example, in the CMU regexen, the dates 196? and None are transformed into 1960-01-01T00:00:00 with granularity 3. Input on this approach is very welcome!

You can merge this Pull Request by running

git pull https://github.com/bennokr/open-cultuur-data cmutrecht

Or view, comment on, or merge it at:

https://github.com/openstate/open-cultuur-data/pull/23

Commit Summary

Added Centraal Museum Utrecht Modecollectie stump
add content fields from xml mapping
hardcoded rights and name and fixed issues
try a robust utf8 xml serialize ; public domain rights
add notebook file
Merge branch 'cmutrecht' of https://github.com/bennokr/open-cultuur-data into cmutrecht
notebook: afmetingen en afbeeldingen
index: afmetingen en afbeeldingen
index_data: acquisition, collections, creator roles, materials, tags, technique
notebook: date granularities
date granularities, fixes
acquisition, attributes, fixes
static file url as get_original_object_urls
documentation
notebook
notebook shorter
remove notebook

File Changes

M docs/user/datasets.rsthttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-0 (57)
A ocd_backend/extractors/cmutrecht.pyhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-1 (30)
A ocd_backend/items/cmutrecht.pyhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-2 (144)
M ocd_backend/sources.jsonhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-3 (8)
M ocd_backend/utils/misc.pyhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-4 (47)

Patch Links:

— Reply to this email directly or view it on GitHubhttps://github.com/openstate/open-cultuur-data/pull/23.

coret commented 10 years ago

The Date granularity parsing functions (and example regexen) are really usefull, a lot of messy dates out there!

Please merge this pull request!!

mbrinkerink commented 10 years ago

I’m also fine with using the issue tracker for questions about the datasets, since I’m monitoring it closely. But I can imagine that people reading the datablogs on the Open Cultuur Data website also want to leave their enquiries there. Is there some kind of GitHub plugin maybe?

Best,

Maarten

Op 2 jun. 2014, om 20:48 heeft Lex Slaghuis notifications@github.com het volgende geschreven:

http://www.opencultuurdata.nl/wp-content/uploads/2014/06/kostuums-CM-totaal.zip

is the link of the content on wordpress ( i assumed unziping is not an issue).

Also there is a debate on where questions on the datasets can be asked. Some suggest the datawiki on open cultuur data, perhaps there are some alternative options. I have the feeling this topic coincides with hosting of static files.

Anyone?

Lex Slaghuis

Van: Benno Kruit [mailto:notifications@github.com] Verzonden: zondag 1 juni 2014 21:33 Aan: openstate/open-cultuur-data Onderwerp: [open-cultuur-data] Centraal Museum Utrecht Modecollectie (#23)

[Afbeelding verwijderd door afzender. logo]https://cloud.githubusercontent.com/assets/994319/3142462/ff798ffe-e9bf-11e3-93f8-0b80a7336b90.png

The Centraal Museum Modecollectiehttp://www.opencultuurdata.nl/wiki/centraal-museum/ consists of fashion items in the Centraal Museum Utrecht. The data is in a static XML file on dropboxhttp://bit.ly/CM18062012, which I mirrored as a gisthttps://gist.githubusercontent.com/bennokr/8cdc528fde1d2a3358d5/raw/308f84452611a77671081e9c7df1bc8139ec3bbd/cmu.xml (8MB) to be able to HTTP request it. This should probably be mirrored instead on OCD servers, as described in issue #12https://github.com/openstate/open-cultuur-data/issues/12?

Notebook

In order to explore the data, @Gijs-Koothttps://github.com/Gijs-Koot and I used this iPython notebookhttp://nbviewer.ipython.org/github/bennokr/open-cultuur-data/blob/ed4721d16ae4fe2829a54a7cbf9e87204a6613f8/parse_centraal_museum_notebook.ipynb for easy debugging and visualising. Have a look to see what the dataset's like. It makes things a lot easier!

Date granularity parsing

In addition to the static XML extractor and the items, this pull request contains a general way of dealing with messy date strings. The function ocd_backend.utils.misc.parse_date parses a date string using a collection of regexen, which look for question marks and stuff like that and reconstruct a granular date. The function ocd_backend.utils.misc.parse_date_span takes two messy date strings, representing an estimated start and end date, and reconstructs a granular date based on the first and the difference between them. For example, in the CMU regexen, the dates 196? and None are transformed into 1960-01-01T00:00:00 with granularity 3. Input on this approach is very welcome!

You can merge this Pull Request by running

git pull https://github.com/bennokr/open-cultuur-data cmutrecht

Or view, comment on, or merge it at:

https://github.com/openstate/open-cultuur-data/pull/23

Commit Summary

Added Centraal Museum Utrecht Modecollectie stump

add content fields from xml mapping

hardcoded rights and name and fixed issues

try a robust utf8 xml serialize ; public domain rights

add notebook file

Merge branch 'cmutrecht' of https://github.com/bennokr/open-cultuur-data into cmutrecht

notebook: afmetingen en afbeeldingen

index: afmetingen en afbeeldingen

index_data: acquisition, collections, creator roles, materials, tags, technique

notebook: date granularities

date granularities, fixes

acquisition, attributes, fixes

static file url as get_original_object_urls

documentation

notebook

notebook shorter

remove notebook

File Changes

M docs/user/datasets.rsthttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-0 (57)

A ocd_backend/extractors/cmutrecht.pyhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-1 (30)

A ocd_backend/items/cmutrecht.pyhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-2 (144)

M ocd_backend/sources.jsonhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-3 (8)

M ocd_backend/utils/misc.pyhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-4 (47)

Patch Links:

https://github.com/openstate/open-cultuur-data/pull/23.patch

https://github.com/openstate/open-cultuur-data/pull/23.diff

— Reply to this email directly or view it on GitHubhttps://github.com/openstate/open-cultuur-data/pull/23. — Reply to this email directly or view it on GitHub.

breyten commented 10 years ago

I'm having doubts at whether it's good or not to parse messay dates, since as a by product, you lose the original information. This becomes somewhat of an issue when dealing with multiple cultural institutions which all have their own way of specifying things, so you might want to get at the original data as it may have more information. It's noted that the regexes are currently cautious in nature, but it becomes only more important as we try to normalise data from a growing number of institutions.
I can't for the life of me figure out what parse_date_span is supposed to do and why it's outputting these granularities. I would just go with the granulairty of the start date

bennokr commented 10 years ago

Lex, thanks, I'll add an unzipper and point it there. Breyten,

You're right, do you think just adding a plaintext date field to get_index_data() is a good solution? Arguably, the date is still in the source field .
The modecollectie items have a production.date.start and production.date.start, which are when the production of the item was started and when it was finished. If an item took 10 years to make, we assumed this means the date granularity for the item should be at most 3. And because the date spans are often not 1, 10 or 100 years, we rounded them to the nearest one to calculate the granularity. This is something we made up ourselves, so it's very probable that there's a better way ;) @mbrinkerink , what do you think?

breyten commented 10 years ago

Ah, I forgot about that API call. Consider my objection removed.
This still does not make sense to me ;) What's the use case? Because you're throwing information away (assuming that the production values have a granularity of 4)

breyten commented 10 years ago

Merged. I left out the date range stuff (sorry benno ;)) and only did a slight modification of get_original_object_urls which was not returning data in the correct format. Closing.

justinvw commented 10 years ago

Hi @bennokr and @breyten, I made some modifications to the merged code:

Removed some unused imports in the CentraalMuseumUtrechtExtractor (138432e7948197072b18bd2abdaf95d7b6f38326)
The XML file is moved to static.opencultuurdata.nl (73c94bb10a26c6d0a6fa689832799ad53965236f).
Ditched parse_oai_response since the XML seems to be well-formed (07cfdab3e40139a8d3f6b54eee1f8ff375d372fb).
We now use the HttpRequestMixin to fetch the XML file (41aed00341c2291b03d52cfb1079ae9264267217)
The URL to the XML file is removed from the original_object_urls since the purpose of these URLs is that an API user can use them to fetch the single object from a remote source. Also, the Centraal Museum Utrecht appears to have unique pages for each item in their collection. The URL of this HTML view is now added to each object (93af48da1011b6fea02873a945bc0d4088cf21ad).
The 500x500px restriction set on the media_urls was removed (a600614417221e1c354901d69e57f733cc55e4bf). It appears that Adlib now serves larger images in some cases. It is unclear to me if we now get the highest available resolution...

I'm still looking into the date parsing and rights stuff.

Feel free to leave a comment if you have any questions or remarks.

openstate / open-cultuur-data

Centraal Museum Utrecht Modecollectie #23

Notebook

Date granularity parsing