openstate / open-cultuur-data

The back- and front-end code that powers the Open Cultuur Data API
http://opencultuurdata.nl/
28 stars 18 forks source link

Centraal Museum Utrecht Modecollectie #23

Closed bennokr closed 10 years ago

bennokr commented 10 years ago

logo

The Centraal Museum Modecollectie consists of fashion items in the Centraal Museum Utrecht. The data is in a static XML file on dropbox, which I mirrored as a gist (8MB) to be able to HTTP request it. This should probably be mirrored instead on OCD servers, as described in issue #12?

Notebook

In order to explore the data, @Gijs-Koot and I used this iPython notebook for easy debugging and visualising. Have a look to see what the dataset's like. It makes things a lot easier!

Date granularity parsing

In addition to the static XML extractor and the items, this pull request contains a general way of dealing with messy date strings. The function ocd_backend.utils.misc.parse_date parses a date string using a collection of regexen, which look for question marks and stuff like that and reconstruct a granular date. The function ocd_backend.utils.misc.parse_date_span takes two messy date strings, representing an estimated start and end date, and reconstructs a granular date based on the first and the difference between them. For example, in the CMU regexen, the dates 196? and None are transformed into 1960-01-01T00:00:00 with granularity 3. Input on this approach is very welcome!

ajslaghu commented 10 years ago

http://www.opencultuurdata.nl/wp-content/uploads/2014/06/kostuums-CM-totaal.zip

is the link of the content on wordpress ( i assumed unziping is not an issue).

Also there is a debate on where questions on the datasets can be asked. Some suggest the datawiki on open cultuur data, perhaps there are some alternative options. I have the feeling this topic coincides with hosting of static files.

Anyone?

Lex Slaghuis

Van: Benno Kruit [mailto:notifications@github.com] Verzonden: zondag 1 juni 2014 21:33 Aan: openstate/open-cultuur-data Onderwerp: [open-cultuur-data] Centraal Museum Utrecht Modecollectie (#23)

[Afbeelding verwijderd door afzender. logo]https://cloud.githubusercontent.com/assets/994319/3142462/ff798ffe-e9bf-11e3-93f8-0b80a7336b90.png

The Centraal Museum Modecollectiehttp://www.opencultuurdata.nl/wiki/centraal-museum/ consists of fashion items in the Centraal Museum Utrecht. The data is in a static XML file on dropboxhttp://bit.ly/CM18062012, which I mirrored as a gisthttps://gist.githubusercontent.com/bennokr/8cdc528fde1d2a3358d5/raw/308f84452611a77671081e9c7df1bc8139ec3bbd/cmu.xml (8MB) to be able to HTTP request it. This should probably be mirrored instead on OCD servers, as described in issue #12https://github.com/openstate/open-cultuur-data/issues/12?

Notebook

In order to explore the data, @Gijs-Koothttps://github.com/Gijs-Koot and I used this iPython notebookhttp://nbviewer.ipython.org/github/bennokr/open-cultuur-data/blob/ed4721d16ae4fe2829a54a7cbf9e87204a6613f8/parse_centraal_museum_notebook.ipynb for easy debugging and visualising. Have a look to see what the dataset's like. It makes things a lot easier!

Date granularity parsing

In addition to the static XML extractor and the items, this pull request contains a general way of dealing with messy date strings. The function ocd_backend.utils.misc.parse_date parses a date string using a collection of regexen, which look for question marks and stuff like that and reconstruct a granular date. The function ocd_backend.utils.misc.parse_date_span takes two messy date strings, representing an estimated start and end date, and reconstructs a granular date based on the first and the difference between them. For example, in the CMU regexen, the dates 196? and None are transformed into 1960-01-01T00:00:00 with granularity 3. Input on this approach is very welcome!


You can merge this Pull Request by running

git pull https://github.com/bennokr/open-cultuur-data cmutrecht

Or view, comment on, or merge it at:

https://github.com/openstate/open-cultuur-data/pull/23

Commit Summary

File Changes

Patch Links:

— Reply to this email directly or view it on GitHubhttps://github.com/openstate/open-cultuur-data/pull/23.

coret commented 10 years ago

The Date granularity parsing functions (and example regexen) are really usefull, a lot of messy dates out there!

Please merge this pull request!!

mbrinkerink commented 10 years ago

I’m also fine with using the issue tracker for questions about the datasets, since I’m monitoring it closely. But I can imagine that people reading the datablogs on the Open Cultuur Data website also want to leave their enquiries there. Is there some kind of GitHub plugin maybe?

Best,

Maarten

Op 2 jun. 2014, om 20:48 heeft Lex Slaghuis notifications@github.com het volgende geschreven:

http://www.opencultuurdata.nl/wp-content/uploads/2014/06/kostuums-CM-totaal.zip

is the link of the content on wordpress ( i assumed unziping is not an issue).

Also there is a debate on where questions on the datasets can be asked. Some suggest the datawiki on open cultuur data, perhaps there are some alternative options. I have the feeling this topic coincides with hosting of static files.

Anyone?

Lex Slaghuis

Van: Benno Kruit [mailto:notifications@github.com] Verzonden: zondag 1 juni 2014 21:33 Aan: openstate/open-cultuur-data Onderwerp: [open-cultuur-data] Centraal Museum Utrecht Modecollectie (#23)

[Afbeelding verwijderd door afzender. logo]https://cloud.githubusercontent.com/assets/994319/3142462/ff798ffe-e9bf-11e3-93f8-0b80a7336b90.png

The Centraal Museum Modecollectiehttp://www.opencultuurdata.nl/wiki/centraal-museum/ consists of fashion items in the Centraal Museum Utrecht. The data is in a static XML file on dropboxhttp://bit.ly/CM18062012, which I mirrored as a gisthttps://gist.githubusercontent.com/bennokr/8cdc528fde1d2a3358d5/raw/308f84452611a77671081e9c7df1bc8139ec3bbd/cmu.xml (8MB) to be able to HTTP request it. This should probably be mirrored instead on OCD servers, as described in issue #12https://github.com/openstate/open-cultuur-data/issues/12?

Notebook

In order to explore the data, @Gijs-Koothttps://github.com/Gijs-Koot and I used this iPython notebookhttp://nbviewer.ipython.org/github/bennokr/open-cultuur-data/blob/ed4721d16ae4fe2829a54a7cbf9e87204a6613f8/parse_centraal_museum_notebook.ipynb for easy debugging and visualising. Have a look to see what the dataset's like. It makes things a lot easier!

Date granularity parsing

In addition to the static XML extractor and the items, this pull request contains a general way of dealing with messy date strings. The function ocd_backend.utils.misc.parse_date parses a date string using a collection of regexen, which look for question marks and stuff like that and reconstruct a granular date. The function ocd_backend.utils.misc.parse_date_span takes two messy date strings, representing an estimated start and end date, and reconstructs a granular date based on the first and the difference between them. For example, in the CMU regexen, the dates 196? and None are transformed into 1960-01-01T00:00:00 with granularity 3. Input on this approach is very welcome!


You can merge this Pull Request by running

git pull https://github.com/bennokr/open-cultuur-data cmutrecht

Or view, comment on, or merge it at:

https://github.com/openstate/open-cultuur-data/pull/23

Commit Summary

  • Added Centraal Museum Utrecht Modecollectie stump
  • add content fields from xml mapping
  • hardcoded rights and name and fixed issues
  • try a robust utf8 xml serialize ; public domain rights
  • add notebook file
  • Merge branch 'cmutrecht' of https://github.com/bennokr/open-cultuur-data into cmutrecht
  • notebook: afmetingen en afbeeldingen
  • index: afmetingen en afbeeldingen
  • index_data: acquisition, collections, creator roles, materials, tags, technique
  • notebook: date granularities
  • date granularities, fixes
  • acquisition, attributes, fixes
  • static file url as get_original_object_urls
  • documentation
  • notebook
  • notebook shorter
  • remove notebook

File Changes

  • M docs/user/datasets.rsthttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-0 (57)
  • A ocd_backend/extractors/cmutrecht.pyhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-1 (30)
  • A ocd_backend/items/cmutrecht.pyhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-2 (144)
  • M ocd_backend/sources.jsonhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-3 (8)
  • M ocd_backend/utils/misc.pyhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-4 (47)

Patch Links:

— Reply to this email directly or view it on GitHubhttps://github.com/openstate/open-cultuur-data/pull/23. — Reply to this email directly or view it on GitHub.

breyten commented 10 years ago
  1. I'm having doubts at whether it's good or not to parse messay dates, since as a by product, you lose the original information. This becomes somewhat of an issue when dealing with multiple cultural institutions which all have their own way of specifying things, so you might want to get at the original data as it may have more information. It's noted that the regexes are currently cautious in nature, but it becomes only more important as we try to normalise data from a growing number of institutions.
  2. I can't for the life of me figure out what parse_date_span is supposed to do and why it's outputting these granularities. I would just go with the granulairty of the start date
bennokr commented 10 years ago

Lex, thanks, I'll add an unzipper and point it there. Breyten,

  1. You're right, do you think just adding a plaintext date field to get_index_data() is a good solution? Arguably, the date is still in the source field .
  2. The modecollectie items have a production.date.start and production.date.start, which are when the production of the item was started and when it was finished. If an item took 10 years to make, we assumed this means the date granularity for the item should be at most 3. And because the date spans are often not 1, 10 or 100 years, we rounded them to the nearest one to calculate the granularity. This is something we made up ourselves, so it's very probable that there's a better way ;) @mbrinkerink , what do you think?
breyten commented 10 years ago
  1. Ah, I forgot about that API call. Consider my objection removed.
  2. This still does not make sense to me ;) What's the use case? Because you're throwing information away (assuming that the production values have a granularity of 4)
breyten commented 10 years ago

Merged. I left out the date range stuff (sorry benno ;)) and only did a slight modification of get_original_object_urls which was not returning data in the correct format. Closing.

justinvw commented 10 years ago

Hi @bennokr and @breyten, I made some modifications to the merged code:

I'm still looking into the date parsing and rights stuff.

Feel free to leave a comment if you have any questions or remarks.