vega / vega-datasets

Common repository for example datasets used by Vega-related projects
269 stars 211 forks source link

Conversion of SOURCES.md to SOURCES.yaml #634

Open dsmedia opened 1 day ago

dsmedia commented 1 day ago

Before migrating the extrinsic metadata of each dataset from markdown to a machine readable format, should we agree on a yaml template that would work well with the new Frictionless tooling? Should we make any required (like sourcing) to ensure future datasets are properly documented before release? Not all have sources now, but we can get those added. What should the yaml file be named?

dangotbanned commented 1 day ago

@dsmedia just to echo https://github.com/vega/vega-datasets/pull/631#issuecomment-2504151082

Is there a big benefit to including the yaml in addition to json? json is much more common (and the only of the two natively supported in python/js) and the readability difference is small that I would say let's only have json.

@domoritz having yaml doesn't benefit me personally, just thought I'd provide the options @dsmedia mentioned in #629 (comment):

Just thinking out loud, but instead of directly maintaining the sources.md file, we could keep the dataset metadata in a json or yaml file, and generate the sources.md file from this machine-readable format.

I'm happy with just json

If we wanted a non-json format, I'd suggest .toml since it is natively supported in python. For the extrinsic fields you mentioned in (https://github.com/vega/vega-datasets/pull/631#issuecomment-2503760452), I imagine the toml-array-of-tables syntax would be handy.

https://github.com/vega/vega-datasets/blob/719c388cc844392cda24517e4e0cda976b1d8519/scripts/build_datapackage.py#L231-L234

I'm not sure how familiar you are with TypedDict(s), but you can enforce any required-and-notrequired constraints you like on the hierarchy I started in build_datapackage.py

domoritz commented 1 day ago

Sounds good to me. I don't mind either format and having automated checks sounds great.

dsmedia commented 45 minutes ago

Might something like this work for a TOML format, containing resource-level (i.e. dataset level) description, source and license information? This is just a proof-of-concept that includes three of the datasets: budget.json, countries.json, and gapminder.json. (I've also pulled into this file the package-level license information now hard-coded into the generation script file, to separate configuration from code.) I assume that the generation script will be able to match these to the resources with their resource name (i.e. the filename without the extension). At a later stage, resource-level column descriptions (for tabular data) could be incorporated into the TOML file (to supplement the column names and types identified by the script).

SOURCES.toml ```toml # Package-level license information [package.license] name = "BSD-3-Clause" path = "https://opensource.org/license/bsd-3-clause" title = "The 3-Clause BSD License" # Resource metadata budget.description = "Budget FY 2016 - Receipts data from the Office of Management and Budget (U.S.)" [budget.sources] title = "Office of Management and Budget (U.S.)" path = "https://www.govinfo.gov/app/details/BUDGET-2016-DB/BUDGET-2016-DB-3" countries.description = """ This dataset combines key demographic indicators (life expectancy at birth and fertility rate measured as babies per woman) for various countries from 1955 to 2000 at 5-year intervals. It includes both current values and adjacent time period values (previous and next) for each indicator. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis. """ [[countries.sources]] title = "Gapminder Foundation - Life Expectancy" path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676" version = "v14" [[countries.sources]] title = "Gapminder Foundation - Fertility" path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676" version = "v14" [countries.licenses] name = "CC-BY-4.0" path = "https://www.gapminder.org/free-material/" title = "Creative Commons Attribution 4.0 International" gapminder.description = """ This dataset combines key demographic indicators (life expectancy at birth, population, and fertility rate measured as babies per woman) for various countries from 1955 to 2005 at 5-year intervals. It also includes a 'cluster' column, a categorical variable grouping countries. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis. """ [[gapminder.sources]] title = "Gapminder Foundation - Life Expectancy" path = "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676" version = "v14" [[gapminder.sources]] title = "Gapminder Foundation - Population" path = "https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676" version = "v7" [[gapminder.sources]] title = "Gapminder Foundation - Fertility" path = "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676" version = "v14" [[gapminder.sources]] title = "Gapminder Foundation - Data Geographies" path = "https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158" version = "v2" [gapminder.licenses] name = "CC-BY-4.0" path = "https://www.gapminder.org/free-material/" title = "Creative Commons Attribution 4.0 International" ```

Here are the relevant excerpts from the data package definition and data resource definition

sources > [sources (resource-level)](https://datapackage.org/standard/data-resource/#sources) > List of data sources as for [Data Package](https://datapackage.org/standard/data-package/#sources). If not specified the resource inherits from the data package. > [sources (package-level)](https://datapackage.org/standard/data-package/#sources) > The raw sources for this data package. It MUST be an array of Source objects. A Source object MUST have at least one property. A Source object is RECOMMENDED to have title property and MAY have path, email, and version properties: > - title: A string containing a title of the source (e.g. document or organization name). > - path: A [URL or Path](https://datapackage.org/standard/glossary/#url-or-path), that is a fully qualified HTTP address, or a relative POSIX path. > - email: A string containing an email address. > - version: A string containing a version of the source. > An example of the object structure is as follows: >> "sources": [{ "title": "World Bank and OECD", "path": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD" }]
licenses [licenses (resource-level)](https://datapackage.org/standard/data-resource/#licenses) > List of licenses as for [Data Package](https://datapackage.org/standard/data-package/#licenses). If not specified the resource inherits from the data package. [licenses (package-level)](https://datapackage.org/standard/data-package/#licenses) > The license(s) under which the package is provided. >> Caution >> This property is not legally binding and does not guarantee the package is licensed under the terms defined in this property. > licenses MUST be an array. Each item in the array is a License. Each MUST be an object. The object MUST contain a name property and/or a path property, and it MAY contain a title property: > - name: A string containing an [Open Definition license ID](http://licenses.opendefinition.org/) > - path: A [URL or Path](https://datapackage.org/standard/glossary/#url-or-path), that is a fully qualified HTTP address, or a relative POSIX path. > - title: A string containing human-readable title. > An example of using the licenses property: >>"licenses": [{ >> "name": "ODC-PDDL-1.0", >> "path": "http://opendatacommons.org/licenses/pddl/", >> "title": "Open Data Commons Public Domain Dedication and License v1.0" >> }]