open-sdg / sdg-build

Python package to convert SDG-related data and metadata between formats
MIT License
5 stars 23 forks source link

Dataflows and metadataflows #214

Open brockfanning opened 3 years ago

brockfanning commented 3 years ago

In SDMX there is the concept of a "dataflow" or "metadataflow", which (as I understand it) is a way to filter the output according to some constraints. We may be able to implement something like that here.

One use-case that definitely exists is in our SDMX output. Many countries may be interested in using the SDMX output in order to submit their data to the UNSD's database. However, this is not possible if the data uses any non-global codes/dimensions. So it would be useful to have a "dataflow" which filters the output to only including global codes/dimensions.

Ideally this filtering would be applied to the data in its internal DataFrame form, so that the feature could be used regardless of whether the output is going to be SDMX, GeoJSON, etc.

brockfanning commented 3 years ago

For data, maybe the mechanism for this could be a "skip_invalid_data" setting. This would depend on something like #20 (being worked on in #190). What I'm thinking is, that when outputting the data, if this setting is true, then any row which has a disaggregation/unit/series value that is not part of the data schema will be skipped.

For example, to take the case of the SDMX for global usage: The data schema would be imported from the global SDMX DSD. Then any data row that uses custom disaggregations (like sub-national REF_AREAs, etc.) will be omitted in the output.

LucyGwilliamAdmin commented 3 years ago

@brockfanning is this done/partly done? sounds familiar

brockfanning commented 3 years ago

@LucyGwilliamAdmin Partly, I'd say.

What I describe in the example above we definitely already have - with the "constrain_data" and "constrain_metadata" parameters.

We also have the "global_content_constraints" which similarly drops rows of data that don't comply with the global content constraints (like that certain series have to be female, etc.).

A couple of things, I think, still need to be done, regarding that "global_content_constraints" parameter:

  1. Abdulla has pointed out (rightly) that it should not silently drop the rows. Instead it should actually fail and abort the build. Countries should actually fix these issues rather than just skipping the data.
  2. Right now this behavior is informed by a hardcoded CSV file. Eventually the SDMX working group plans to put these constraints into the "dataflow". Maybe, whenever this happens we can revisit and change our code to use that dataflow instead of the CSV file.

Thoughts?

LucyGwilliamAdmin commented 3 years ago

@brockfanning thanks, that makes sense

  1. Yeah, I think does make sense that rows aren't silently dropped but also, wondering what should happen if country wants to show disseminate additional information that doesn't comply with DSD? I know that's what we'll want to do in UK
  2. I haven't learnt too much about SDMX dataflows but think that would make sense as means we wouldn't have to maintain a CSV