Open dougmet opened 5 years ago
@jwestw This is related to what we discussed today. As mentioned above this is relevant to the output of CSVW. In addition, this would meet a need in Open SDG, where we don't have any way to control the ordering of the columns.
Countries that are inputting their data from SDMX already have a data schema, in their DSD (data structure definition). So I think we can focus on the use-case of countries that are using CSV files, like the UK.
I'll throw out some ideas for approaches below. Personally I kind of lean towards "jsonschema per indicator" along with "auto-generated".
With this approach, there would be a single (very long) jsonschema file in the country's data repository, like "data-schema.json". It would be a full collection of all the columns and values used across all indicators. For example, part of it might look like this:
{
"Age": {
"title": "Age",
"description": "Description of the age column.",
"type": "string",
"enum": [
"Under 15",
"16 to 24"
]
},
"Sex": {
"title": "Sex",
"description": "Description of the sex column.",
"type": "string",
"enum": [
"Not specified",
"Female",
"Male"
]
},
etc...
}
Pros: centrally located and comprehensive (this is analogous to an SDMX DSD) Cons: The same file would need to be updated every time a data manager wants to add a new disaggregation column or value
With this approach there would be a separate jsonschema file for each indicator. It would look the same as the above, but would only contain the columns/values that are used in that indicator.
Pros: each indicator can be configured separately Cons: may be some duplication
This approach could be combined with one of the other two. In this approach, if a column did not have any jsonschema representation, then that jsonschema would be auto-generated, assuming "type": "string"
and an enum
of all the unique values in the column. (@jwestw I suspect this is partly what that ONS pipeline is doing when it converts to CSVW. So it's possible we could re-use that code or use it as a dependency if possible.) Presumably during auto-generation the order of the columns and values would default to alphabetical.
Using this same code we could also provide a way for countries to "initialize" a jsonschema file, for the purposes of customizing it. For example, say a country wants to customize their data schema for indicator 1.1.1 - they could run a Python script like python scripts/init-data-schema.py 1.1.1
or something to that effect, which would result in an auto-generated data-schemas/1-1-1.json
file.
Pros: spares the countries from needing to maintain jsonschema Cons: none
This is to support reading (#9) and writing to (#15) SDMX and CSVW (#19)
The indicator class will store the raw data, and metadata, and we also want the schema for the data. This includes things such as: