project-open-data / project-open-data.github.io

Open Data Policy — Managing Information as an Asset
https://project-open-data.cio.gov/
Other
1.34k stars 583 forks source link

Better definition for DataDictionary #419

Closed chachasikes closed 9 years ago

chachasikes commented 9 years ago

I'm researching the use and formatting of open data dictionaries.

The Project Open Data schema defines a dataDictionary as: "URL to the data dictionary for the dataset or API. Note that documentation other than a data dictionary can be referenced using Related Documents as shown in the expanded fields."

It could be helpful to provide a definition of what a data dictionary is actually supposed to be, and contribute some best practices towards formatting data dictionaries. Schemas help people put information out there in formats that are more useful, and I would like to see data dictionaries improve in the next couple of years.

(My initial survey)

At its simplest, a data dictionary can be

I looked at Socrata (looking at the raw JSON data output), and they currently have the following fields associated with their fields:


From Wikipedia: http://en.wikipedia.org/wiki/Data_dictionary A data dictionary, or metadata repository, as defined in the IBM Dictionary of Computing, is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format."[1] The term can have one of several closely related meanings pertaining to databases and database management systems (DBMS):


My Notes on Data Dictionaries

I very much appreciate your schema and look forward to getting to know it better. If you have any responses or feedback, I would like to hear it, or even help out with this issue.

p.s. I also think a schema for JSON compatible data dictionary content would be helpful, but am researching that and will follow up with a separate issue if I find anything useful.

philipashlock commented 9 years ago

Thanks for the feedback! On your last point on a schema for data dictionary content, we've already included some guidance on using established schema standards for machine readable API documentation in our API guidance but we haven't done this for any other kind of datasets. For simple tabular data like CSVs, it looks like the JSON based Data Package specification, specifically the JSON Table Schema, has potential (that page also links to related work). It looks like that spec work is now being formalized as part of the W3C Model for Tabular Data and Metadata on the Web.

For something more complex like XML or for anything that's just free form HTML, it would be good to have an ideal way of formatting and presenting that (using an XSL if needed) with a basic HTML5/CSS template.

rebeccawilliams commented 9 years ago

Closing this issue, but I have ported over the notes to #465