open-sdg / sdg-build

Python package to convert SDG-related data and metadata between formats
MIT License
6 stars 22 forks source link

Support data schema (column metadata) #20

Open dougmet opened 5 years ago

dougmet commented 5 years ago

This is to support reading (#9) and writing to (#15) SDMX and CSVW (#19)

The indicator class will store the raw data, and metadata, and we also want the schema for the data. This includes things such as:

brockfanning commented 3 years ago

@jwestw This is related to what we discussed today. As mentioned above this is relevant to the output of CSVW. In addition, this would meet a need in Open SDG, where we don't have any way to control the ordering of the columns.

Countries that are inputting their data from SDMX already have a data schema, in their DSD (data structure definition). So I think we can focus on the use-case of countries that are using CSV files, like the UK.

I'll throw out some ideas for approaches below. Personally I kind of lean towards "jsonschema per indicator" along with "auto-generated".

One central jsonschema file

With this approach, there would be a single (very long) jsonschema file in the country's data repository, like "data-schema.json". It would be a full collection of all the columns and values used across all indicators. For example, part of it might look like this:

{
    "Age": {
        "title": "Age",
        "description": "Description of the age column.",
        "type": "string",
        "enum": [
            "Under 15",
            "16 to 24"
         ]
    },
    "Sex": {
        "title": "Sex",
        "description": "Description of the sex column.",
        "type": "string",
        "enum": [
            "Not specified",
            "Female",
            "Male"
         ]
    },
    etc...
}

Pros: centrally located and comprehensive (this is analogous to an SDMX DSD) Cons: The same file would need to be updated every time a data manager wants to add a new disaggregation column or value

Jsonschema per indicator

With this approach there would be a separate jsonschema file for each indicator. It would look the same as the above, but would only contain the columns/values that are used in that indicator.

Pros: each indicator can be configured separately Cons: may be some duplication

Auto-generated

This approach could be combined with one of the other two. In this approach, if a column did not have any jsonschema representation, then that jsonschema would be auto-generated, assuming "type": "string" and an enum of all the unique values in the column. (@jwestw I suspect this is partly what that ONS pipeline is doing when it converts to CSVW. So it's possible we could re-use that code or use it as a dependency if possible.) Presumably during auto-generation the order of the columns and values would default to alphabetical.

Using this same code we could also provide a way for countries to "initialize" a jsonschema file, for the purposes of customizing it. For example, say a country wants to customize their data schema for indicator 1.1.1 - they could run a Python script like python scripts/init-data-schema.py 1.1.1 or something to that effect, which would result in an auto-generated data-schemas/1-1-1.json file.

Pros: spares the countries from needing to maintain jsonschema Cons: none