open-sdg / sdg-build

Python package to convert SDG-related data and metadata between formats
MIT License
5 stars 23 forks source link

Data input considerations #200

Open LucyGwilliamAdmin opened 3 years ago

LucyGwilliamAdmin commented 3 years ago

As we start making different inputs available, we need to consider how the open-sdg-data-starter looks. I though it would be good to start a conversation here.

@brockfanning I know you mentioned above moving to an sdmx data starter.

Here are some of my thoughts.

What are others thoughts?

LucyGwilliamAdmin commented 3 years ago

And further on from this re. metadata, something which I find is easier, especially when using SDMX, is removing the reporting_status field from the md files so that it defaults to what is configured (complete when data is available, notstared when data isn't available)

This would also be useful if not having placeholder files for each indicator

EDIT: I have a script which could be run on the md files if we wish to implement this

brockfanning commented 3 years ago

@LucyGwilliamAdmin I agree with removing all those placeholder files, and just having examples for 1.1.1. Also I agree with removing reporting_status from the md files.

Regarding defaulting to SDMX - my only concern is that working with SDMX is harder than CSV - or at least, it requires less-well-known tools. With CSV you can use Excel or even hand-edit it, but with SDMX you would likely need an SDMX tool.

But your point is well-taken that it is harder to switch from CSV to SDMX than the other way round. I think the typical culprits here are any CSV files that have non-SDMX-compliant column names and/or values. So I'm wondering if we could focus on getting all the "plumbing" in place so that the CSV example file could start off with SDMX-compliant columns/values. For example, if the CSV file could start off like:

TIME_DETAIL SERIES UNIT_MEASURE OBS_VALUE
2008 SI_POV_DAY1 PT 76

A couple of PRs that could help get us towards that are #182 and https://github.com/open-sdg/open-sdg/pull/1027. These would at least allow for SERIES and UNIT_MEASURE, so we could have something like this:

Year SERIES UNIT_MEASURE Value
2008 SI_POV_DAY1 PT 76

(The next step would be to allow TIME_DETAIL instead of Year, and OBS_VALUE instead of Value. These are a bit more tricky but definitely possible.)

Eventually, along with #202, we could have the data-starter ship with CSV files that are already being converted into SDMX, without requiring a DSD.