Closed bennokr closed 10 years ago
http://www.opencultuurdata.nl/wp-content/uploads/2014/06/kostuums-CM-totaal.zip
is the link of the content on wordpress ( i assumed unziping is not an issue).
Also there is a debate on where questions on the datasets can be asked. Some suggest the datawiki on open cultuur data, perhaps there are some alternative options. I have the feeling this topic coincides with hosting of static files.
Anyone?
Lex Slaghuis
Van: Benno Kruit [mailto:notifications@github.com] Verzonden: zondag 1 juni 2014 21:33 Aan: openstate/open-cultuur-data Onderwerp: [open-cultuur-data] Centraal Museum Utrecht Modecollectie (#23)
[Afbeelding verwijderd door afzender. logo]https://cloud.githubusercontent.com/assets/994319/3142462/ff798ffe-e9bf-11e3-93f8-0b80a7336b90.png
The Centraal Museum Modecollectiehttp://www.opencultuurdata.nl/wiki/centraal-museum/ consists of fashion items in the Centraal Museum Utrecht. The data is in a static XML file on dropboxhttp://bit.ly/CM18062012, which I mirrored as a gisthttps://gist.githubusercontent.com/bennokr/8cdc528fde1d2a3358d5/raw/308f84452611a77671081e9c7df1bc8139ec3bbd/cmu.xml (8MB) to be able to HTTP request it. This should probably be mirrored instead on OCD servers, as described in issue #12https://github.com/openstate/open-cultuur-data/issues/12?
Notebook
In order to explore the data, @Gijs-Koothttps://github.com/Gijs-Koot and I used this iPython notebookhttp://nbviewer.ipython.org/github/bennokr/open-cultuur-data/blob/ed4721d16ae4fe2829a54a7cbf9e87204a6613f8/parse_centraal_museum_notebook.ipynb for easy debugging and visualising. Have a look to see what the dataset's like. It makes things a lot easier!
Date granularity parsing
In addition to the static XML extractor and the items, this pull request contains a general way of dealing with messy date strings. The function ocd_backend.utils.misc.parse_date parses a date string using a collection of regexen, which look for question marks and stuff like that and reconstruct a granular date. The function ocd_backend.utils.misc.parse_date_span takes two messy date strings, representing an estimated start and end date, and reconstructs a granular date based on the first and the difference between them. For example, in the CMU regexen, the dates 196? and None are transformed into 1960-01-01T00:00:00 with granularity 3. Input on this approach is very welcome!
You can merge this Pull Request by running
git pull https://github.com/bennokr/open-cultuur-data cmutrecht
Or view, comment on, or merge it at:
https://github.com/openstate/open-cultuur-data/pull/23
Commit Summary
File Changes
Patch Links:
— Reply to this email directly or view it on GitHubhttps://github.com/openstate/open-cultuur-data/pull/23.
The Date granularity parsing functions (and example regexen) are really usefull, a lot of messy dates out there!
Please merge this pull request!!
I’m also fine with using the issue tracker for questions about the datasets, since I’m monitoring it closely. But I can imagine that people reading the datablogs on the Open Cultuur Data website also want to leave their enquiries there. Is there some kind of GitHub plugin maybe?
Best,
Maarten
Op 2 jun. 2014, om 20:48 heeft Lex Slaghuis notifications@github.com het volgende geschreven:
http://www.opencultuurdata.nl/wp-content/uploads/2014/06/kostuums-CM-totaal.zip
is the link of the content on wordpress ( i assumed unziping is not an issue).
Also there is a debate on where questions on the datasets can be asked. Some suggest the datawiki on open cultuur data, perhaps there are some alternative options. I have the feeling this topic coincides with hosting of static files.
Anyone?
Lex Slaghuis
Van: Benno Kruit [mailto:notifications@github.com] Verzonden: zondag 1 juni 2014 21:33 Aan: openstate/open-cultuur-data Onderwerp: [open-cultuur-data] Centraal Museum Utrecht Modecollectie (#23)
[Afbeelding verwijderd door afzender. logo]https://cloud.githubusercontent.com/assets/994319/3142462/ff798ffe-e9bf-11e3-93f8-0b80a7336b90.png
The Centraal Museum Modecollectiehttp://www.opencultuurdata.nl/wiki/centraal-museum/ consists of fashion items in the Centraal Museum Utrecht. The data is in a static XML file on dropboxhttp://bit.ly/CM18062012, which I mirrored as a gisthttps://gist.githubusercontent.com/bennokr/8cdc528fde1d2a3358d5/raw/308f84452611a77671081e9c7df1bc8139ec3bbd/cmu.xml (8MB) to be able to HTTP request it. This should probably be mirrored instead on OCD servers, as described in issue #12https://github.com/openstate/open-cultuur-data/issues/12?
Notebook
In order to explore the data, @Gijs-Koothttps://github.com/Gijs-Koot and I used this iPython notebookhttp://nbviewer.ipython.org/github/bennokr/open-cultuur-data/blob/ed4721d16ae4fe2829a54a7cbf9e87204a6613f8/parse_centraal_museum_notebook.ipynb for easy debugging and visualising. Have a look to see what the dataset's like. It makes things a lot easier!
Date granularity parsing
In addition to the static XML extractor and the items, this pull request contains a general way of dealing with messy date strings. The function ocd_backend.utils.misc.parse_date parses a date string using a collection of regexen, which look for question marks and stuff like that and reconstruct a granular date. The function ocd_backend.utils.misc.parse_date_span takes two messy date strings, representing an estimated start and end date, and reconstructs a granular date based on the first and the difference between them. For example, in the CMU regexen, the dates 196? and None are transformed into 1960-01-01T00:00:00 with granularity 3. Input on this approach is very welcome!
You can merge this Pull Request by running
git pull https://github.com/bennokr/open-cultuur-data cmutrecht
Or view, comment on, or merge it at:
https://github.com/openstate/open-cultuur-data/pull/23
Commit Summary
- Added Centraal Museum Utrecht Modecollectie stump
- add content fields from xml mapping
- hardcoded rights and name and fixed issues
- try a robust utf8 xml serialize ; public domain rights
- add notebook file
- Merge branch 'cmutrecht' of https://github.com/bennokr/open-cultuur-data into cmutrecht
- notebook: afmetingen en afbeeldingen
- index: afmetingen en afbeeldingen
- index_data: acquisition, collections, creator roles, materials, tags, technique
- notebook: date granularities
- date granularities, fixes
- acquisition, attributes, fixes
- static file url as get_original_object_urls
- documentation
- notebook
- notebook shorter
- remove notebook
File Changes
- M docs/user/datasets.rsthttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-0 (57)
- A ocd_backend/extractors/cmutrecht.pyhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-1 (30)
- A ocd_backend/items/cmutrecht.pyhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-2 (144)
- M ocd_backend/sources.jsonhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-3 (8)
- M ocd_backend/utils/misc.pyhttps://github.com/openstate/open-cultuur-data/pull/23/files#diff-4 (47)
Patch Links:
- https://github.com/openstate/open-cultuur-data/pull/23.patch
- https://github.com/openstate/open-cultuur-data/pull/23.diff
— Reply to this email directly or view it on GitHubhttps://github.com/openstate/open-cultuur-data/pull/23. — Reply to this email directly or view it on GitHub.
Lex, thanks, I'll add an unzipper and point it there. Breyten,
get_index_data()
is a good solution? Arguably, the date is still in the source
field .production.date.start
and production.date.start
, which are when the production of the item was started and when it was finished. If an item took 10 years to make, we assumed this means the date granularity for the item should be at most 3. And because the date spans are often not 1, 10 or 100 years, we rounded them to the nearest one to calculate the granularity. This is something we made up ourselves, so it's very probable that there's a better way ;) @mbrinkerink , what do you think? Merged. I left out the date range stuff (sorry benno ;)) and only did a slight modification of get_original_object_urls which was not returning data in the correct format. Closing.
Hi @bennokr and @breyten, I made some modifications to the merged code:
parse_oai_response
since the XML seems to be well-formed (07cfdab3e40139a8d3f6b54eee1f8ff375d372fb).original_object_urls
since the purpose of these URLs is that an API user can use them to fetch the single object from a remote source. Also, the Centraal Museum Utrecht appears to have unique pages for each item in their collection. The URL of this HTML view is now added to each object (93af48da1011b6fea02873a945bc0d4088cf21ad).media_urls
was removed (a600614417221e1c354901d69e57f733cc55e4bf). It appears that Adlib now serves larger images in some cases. It is unclear to me if we now get the highest available resolution...I'm still looking into the date parsing and rights stuff.
Feel free to leave a comment if you have any questions or remarks.
The Centraal Museum Modecollectie consists of fashion items in the Centraal Museum Utrecht. The data is in a static XML file on dropbox, which I mirrored as a gist (8MB) to be able to HTTP request it. This should probably be mirrored instead on OCD servers, as described in issue #12?
Notebook
In order to explore the data, @Gijs-Koot and I used this iPython notebook for easy debugging and visualising. Have a look to see what the dataset's like. It makes things a lot easier!
Date granularity parsing
In addition to the static XML extractor and the items, this pull request contains a general way of dealing with messy date strings. The function
ocd_backend.utils.misc.parse_date
parses a date string using a collection of regexen, which look for question marks and stuff like that and reconstruct a granular date. The functionocd_backend.utils.misc.parse_date_span
takes two messy date strings, representing an estimated start and end date, and reconstructs a granular date based on the first and the difference between them. For example, in the CMU regexen, the dates196?
andNone
are transformed into1960-01-01T00:00:00
with granularity3
. Input on this approach is very welcome!