project-open-data / project-open-data.github.io

Open Data Policy — Managing Information as an Asset
https://project-open-data.cio.gov/
Other
1.33k stars 585 forks source link

Document examples of data dictionary specifications and standards #465

Open peterbenmeyer opened 9 years ago

peterbenmeyer commented 9 years ago

An interagency team from federal statistical agencies is working on a data dictionary specification for the statistical agencies to use for Web-API accessible data. Here's our 1.01 draft:

https://github.com/USG-SCOPE/data-dictionary/blob/gh-pages/Metadata-Scheme-for-Data-Dictionaries.md

We ask for input from Project Open Data experts. We are interested in merging this with recommendations by POD. We'd welcome pointers to good practices to incorporate and/or recommend to statistical agencies. Please comment below and when we have useful feedback to chew on I (or somebody) can close this issue.

philipashlock commented 9 years ago

Thanks for developing this! It looks like there's some great work here. I wonder if we could also agree on a way to serialize this using a consistent format, e.g. JSON, or ideally even incorporate it as an extension to an existing standard so that this can be represented in a way that allows interoperability in a consistent machine readable way.

For API's, there are a number of emerging standards we reference on the API guidance page (Swagger, RAML, API Blueprint, HAL, Hydra, etc) with an explanation of how to cite them in the metadata as a data dictionary, but from my experience Swagger has the most community driven momentum and also most use within government.

For raw tabular bulk data (rather than an API) there's the emerging W3C Tabular Data Model standard. Here's the current W3C Tabular Data Model draft but it builds on the Tabular Data Package and JSON Table Schema work (which also incorporates JSON Schema which Swagger also uses) which may be easier to read than the current W3C draft.

peterbenmeyer commented 9 years ago

That's an enriched reply! Thank you. We'll study and identify next steps. I'm leaving this open for now as we may want more back and forth.

rebeccawilliams commented 9 years ago

@philipashlock @gbinal etc, thoughts on including the bulk data standards somewhere on POD? Perhaps including a new page linked from the datasets section of What to Document – Datasets and Web APIs?

Thoughts on ways to highlight this data dictionary spec work?

gbinal commented 9 years ago

@rebeccawilliams - Sure - I think they could go fine being linked from any number of places. I think the hard part is framing them to give context for agency staff so that when they read about it in the site, it makes sense how they fit into all this.

peterbenmeyer commented 9 years ago

Thanks, all. We'll revise and expand to include a JSON example, and to make reference to some of the emerging standards that Philip refers to, either to build from, or at least to give context.
We'll get back to Project Open Data with another version and ask for further comments.

More comments would be welcome. I am especially interested in admirable examples of data dictionaries. Would like to refer to them in our spec and/or design recommendations to match them.

rebeccawilliams commented 8 years ago

@peterbenmeyer and/or @philipashlock want to take a pass at this guidance?

rebeccawilliams commented 8 years ago

Adding notes from @chachasikes (#419) here:

I'm researching the use and formatting of open data dictionaries.

The Project Open Data schema defines a dataDictionary as: "URL to the data dictionary for the dataset or API. Note that documentation other than a data dictionary can be referenced using Related Documents as shown in the expanded fields."

It could be helpful to provide a definition of what a data dictionary is actually supposed to be, and contribute some best practices towards formatting data dictionaries. Schemas help people put information out there in formats that are more useful, and I would like to see data dictionaries improve in the next couple of years.

In my survey of 10 data dictionaries, I found at least one mistaken file on data.gov (for the SSA Baby Names data dictionary.) I will keep track of other errors I find as I look at the data.

The formats of the dictionaries vary in an extreme way and are very complicated to read: PDFs, search engines, excel files and word docs.

It would be nice to see some efforts at the communication aspects of the data dictionaries be improved. Gov.UK has some style guidelines on plain-language writing which might help anyone who is creating new data dictionaries or updating older ones.

(My initial survey)

At its simplest, a data dictionary can be

a list of the field names and their definitions (see City of Chicago Data Dictionary). It's helpful if the field names also come with a table for the attributes for each field, since field values often use reference codes. I looked at Socrata (looking at the raw JSON data output), and they currently have the following fields associated with their fields:

id, name (machine readable name), fieldName (display name), dataTypeName (checkbox, text, integer, etc), description, comments, renderTypeName (for machine display of field content), cachedContents (a summary) From Wikipedia: http://en.wikipedia.org/wiki/Data_dictionary A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format."[1] The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):

A document describing a database or collection of databases An integral component of a DBMS that is required to determine its structure A piece of middleware that extends or supplants the native data dictionary of a DBMS My Notes on Data Dictionaries

I very much appreciate your schema and look forward to getting to know it better. If you have any responses or feedback, I would like to hear it, or even help out with this issue.

Chach Sikes p.s. I also think a schema for JSON compatible data dictionary content would be helpful, but am researching that and will follow up with a separate issue if I find anything useful.