ropensci / auunconf

repository for the Australian rOpenSci unconference 2016!
18 stars 4 forks source link

Creating high quality metadata using ropensci/EML (and ingesting these into workflows) #11

Open ivanhanigan opened 8 years ago

ivanhanigan commented 8 years ago

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data

This EML metadata standard allows highly detailed information about methods, classificatory protocols, spatial and temporal coverage and ownership/intellectual rights. It is also the language used by the open-source metacat portal for publishing data.

The ropensci EML package https://github.com/ropensci/EML has been in development for years and is now approaching a release to CRAN in the near future.

I suggest to consider the idea raised in the issue below to "have a little mini-hackathon where EML/R users could propose use cases, and we could review and code solutions to make those use cases for metadata creation straightforward" https://github.com/ropensci/EML/issues/144#issuecomment-194010572

ivanhanigan commented 8 years ago

Hi @tierneyn and @jonocarroll (CC @cboettig) I'm sorry to miss out on visiting you at the ropensci/auunconf but I've set aside the time to work on this issue.
I'll be spending today and tomorrow on implementing a workflow to munge data and create high quality metadata for one of our projects called APTEMA: "Air Pollution, Traffic Exposures and Mortality and Morbidity in Older Australians". This is a sub-study of the 45 and up survey, and we aim to understand the impact of the social, economic and environmental factors on the health of Australians in mid - later life; focussing on the opportunities for prevention. We need tools for synthesis and integration of heterogenous data. I therefore am testing the implementation of the https://github.com/ropensci/EML package.

A key component to this project will be air pollution estimates at small areas. This sub-sub-study is a complex set of linked datasets/workflows and all get combined at the main github repo https://github.com/swish-climate-impact-assessment/AirPollutionNeighbourhoodExposures. The work package aims to use the Bayesian Maximum Entropy geostatistical method (Christakos et al 2002) to blend data with different levels of uncertainty for long-term pollution exposure estimates in Sydney and Perth.

In this discrete phase of work I will be combining the following datasets and creating metadata for each source dataset, and the combined output dataset.

I'll stop by the slack channel to see what other teams are doing too, but I'm going to dig in to this air pollution data now. Seeya!

ivanhanigan commented 8 years ago

I've got a complete (but invalid) minimal XML document for a dataset now. The R script is in the README.md at https://github.com/swish-climate-impact-assessment/CTM_CSIRO_Sydney_Shipping_2010_2011/ Any comments would be great. I might go thru the same steps with a dataTable format and a spatialVector format tomorrow before coming back and seeing why these XML validation rules are being broken.

ivanhanigan commented 8 years ago

@cboettig my tested dataTable version now working. the Methods seems to be attached to an invalid slot... I like this attributes/factor combination of data.frames https://github.com/swish-climate-impact-assessment/OEH_monitor_Sydney_2007_2014