Data input considerations

LucyGwilliamAdmin commented 3 years ago

As we start making different inputs available, we need to consider how the open-sdg-data-starter looks. I though it would be good to start a conversation here.

@brockfanning I know you mentioned above moving to an sdmx data starter.

Here are some of my thoughts.

The data folder shouldn't have a placeholder file for each indicator, no matter what data type is used - this makes it really difficult to change to a diff data input as very tedious to bulk delete files from GitHub using web interface.
Instead maybe the data folder could have a placeholder for each file type (.csv, .xml, .json) for one indicator (e.g. 1.1.1 as this is the one that is always turned on) - this makes it easier to change data input type because

a. no need to delete 200+ files that are no longer needed. Instead would just need to delete the two from the data types not being used - and even then they don't NEED to be deleted, they can just sit there and not be an issue because there's so little of them.

b. when the inputs option is changed in the data config file before any new data type files have been uploaded, the build won't fail as it will be able to find at least one of the files that is needed (as long as DSD is uploaded or pointed to when changing data input, if next point is not implemented alongside this)
default is changed from CSV to SDMX - will make changing type easier as think it's much easier to change from using SDMX to CSV than it is from using CSV to SDMX

What are others thoughts?

LucyGwilliamAdmin commented 3 years ago

And further on from this re. metadata, something which I find is easier, especially when using SDMX, is removing the reporting_status field from the md files so that it defaults to what is configured (complete when data is available, notstared when data isn't available)

This would also be useful if not having placeholder files for each indicator

EDIT: I have a script which could be run on the md files if we wish to implement this

brockfanning commented 3 years ago

@LucyGwilliamAdmin I agree with removing all those placeholder files, and just having examples for 1.1.1. Also I agree with removing reporting_status from the md files.

Regarding defaulting to SDMX - my only concern is that working with SDMX is harder than CSV - or at least, it requires less-well-known tools. With CSV you can use Excel or even hand-edit it, but with SDMX you would likely need an SDMX tool.

But your point is well-taken that it is harder to switch from CSV to SDMX than the other way round. I think the typical culprits here are any CSV files that have non-SDMX-compliant column names and/or values. So I'm wondering if we could focus on getting all the "plumbing" in place so that the CSV example file could start off with SDMX-compliant columns/values. For example, if the CSV file could start off like:

TIME_DETAIL	SERIES	UNIT_MEASURE	OBS_VALUE
2008	SI_POV_DAY1	PT	76

A couple of PRs that could help get us towards that are #182 and https://github.com/open-sdg/open-sdg/pull/1027. These would at least allow for SERIES and UNIT_MEASURE, so we could have something like this:

Year	SERIES	UNIT_MEASURE	Value
2008	SI_POV_DAY1	PT	76

(The next step would be to allow TIME_DETAIL instead of Year, and OBS_VALUE instead of Value. These are a bit more tricky but definitely possible.)

Eventually, along with #202, we could have the data-starter ship with CSV files that are already being converted into SDMX, without requiring a DSD.

open-sdg / sdg-build

Data input considerations #200