Open dsmedia opened 1 day ago
@dsmedia just to echo https://github.com/vega/vega-datasets/pull/631#issuecomment-2504151082
Is there a big benefit to including the yaml in addition to json? json is much more common (and the only of the two natively supported in python/js) and the readability difference is small that I would say let's only have json.
@domoritz having
yaml
doesn't benefit me personally, just thought I'd provide the options @dsmedia mentioned in #629 (comment):Just thinking out loud, but instead of directly maintaining the sources.md file, we could keep the dataset metadata in a json or yaml file, and generate the sources.md file from this machine-readable format.
I'm happy with just
json
If we wanted a non-json
format, I'd suggest .toml
since it is natively supported in python
.
For the extrinsic fields you mentioned in (https://github.com/vega/vega-datasets/pull/631#issuecomment-2503760452), I imagine the toml-array-of-tables syntax would be handy.
I'm not sure how familiar you are with TypedDict
(s), but you can enforce any required-and-notrequired constraints you like on the hierarchy I started in build_datapackage.py
Sounds good to me. I don't mind either format and having automated checks sounds great.
Might something like this work for a TOML format, containing resource-level (i.e. dataset level) description, source and license information? This is just a proof-of-concept that includes three of the datasets: budget.json
, countries.json
, and gapminder.json
. (I've also pulled into this file the package-level license information now hard-coded into the generation script file, to separate configuration from code.) I assume that the generation script will be able to match these to the resources with their resource name (i.e. the filename without the extension). At a later stage, resource-level column descriptions (for tabular data) could be incorporated into the TOML file (to supplement the column names and types identified by the script).
Here are the relevant excerpts from the data package definition and data resource definition
Before migrating the extrinsic metadata of each dataset from markdown to a machine readable format, should we agree on a yaml template that would work well with the new Frictionless tooling? Should we make any required (like sourcing) to ensure future datasets are properly documented before release? Not all have sources now, but we can get those added. What should the yaml file be named?