open-sdg / sdg-build

Python package to convert SDG-related data and metadata between formats
MIT License
5 stars 23 forks source link

Expanded input alterations #241

Closed brockfanning closed 2 years ago

brockfanning commented 3 years ago

This includes the following changes:

  1. The indicator ID can be altered with a function, just like data and metadata
  2. The indicator name can be altered with a function, just like data and metadata
  3. All alter functions will be passed an optional "context" dict that contains the indicator_id, indicator_name, data, meta

Existing data/meta alter functions should still work. Examples and more details are in the updated README in this PR.

Testing note

In particular I would be interested in testing the indicator ID alteration, to make sure it doesn't cause any issues. The other stuff is important to test also though.

LucyGwilliamAdmin commented 3 years ago

@brockfanning I'm trying to alter metadata based on data but I'm actually not too sure how this new functionality works?

brockfanning commented 3 years ago

@LucyGwilliamAdmin I added some examples in the README: https://github.com/open-sdg/sdg-build/blob/33c54439e85bbfe0ed26e7d7725d1bb1b57d0b60/README.md#alterations-of-data-metadata-indicator-id-and-indicator-name

LucyGwilliamAdmin commented 3 years ago

@brockfanning thanks - does that code go in the check and build data python files?

LucyGwilliamAdmin commented 3 years ago

Does this sound about right for setting reporting status to complete if there's data?

def alter_meta(meta, context):
    if context[data].empty == False:
        meta['reporting_status']='complete'
    return meta

my_input.add_meta_alteration(alter_meta)

Also what is my_input?

brockfanning commented 3 years ago

Ah sorry, my_input is in reference to using the inputs directly, rather than through the data_config.yml approach. You can ignore that. All that is changing is that you can now use a second parameter in your existing alter functions: context.

Your snippet looks pretty good, except there should be quotes around data, like: context['data']

LucyGwilliamAdmin commented 3 years ago

@brockfanning thank you - what is data if it is empty? I'm getting this error when I try to check if it's an empty df: https://github.com/open-sdg-nepal/data/runs/2555712994?check_suite_focus=true#step:5:14

EDIT: Think I've got it if not context['data'] is None:

brockfanning commented 3 years ago

Cool if that works - I've always put the "not" in a different place: if context['data'] is not None. You can also check for both: if context['data'] is not None and context['data'].empty == False

LucyGwilliamAdmin commented 3 years ago

@brockfanning here are my changes but it doesn't seemed to have work: https://github.com/open-sdg-nepal/data/pull/33/files

brockfanning commented 3 years ago

@LucyGwilliamAdmin I think it is a problem with the fact that there is an indicator config input that has reporting_status set to notstarted. A quick fix might be to move the indicator config input to be first, before the SDMX input. Longer term we could think more about how these interact, because I agree that it's unexpected. (You would think that even when the indicator config is being altered, the data would still be there.)

LucyGwilliamAdmin commented 3 years ago

Ah yes, moving it fixed the issue - thank you :)

brockfanning commented 3 years ago

Thinking a bit more on it, it will be tough/impossible to make each input have access to all the other inputs' stuff. So I think the best solution is to make clear in our documentation that the alter functions operate on each input separately.

LucyGwilliamAdmin commented 3 years ago

@brockfanning is context viewable anywhere?

LucyGwilliamAdmin commented 3 years ago

@brockfanning how do I call my_indicator_id_alteration function in build_data.py?

Currently I have open_sdg_build(config='config_data.yml', alter_meta=alter_meta, add_indicator_id_alteration=my_indicator_id_alteration) but that doesn't seem to work

brockfanning commented 3 years ago

@LucyGwilliamAdmin Try alter_indicator_id=my_indicator_id_alteration. Similarly, the parameter for altering names is alter_indicator_name.

To see what's in context you might put print(context) in the function and then look at the output during the build.

LucyGwilliamAdmin commented 3 years ago

@brockfanning I'm not too sure what's happening or what's expected to happen

I'm trying to alter indicator_id to '1-11-1' when indicator_name is 'Proportion of population below the international poverty line, by sex, age, employment status and geographical location (urban/rural)'

Then I'm trying to alter indicator_name with indicator_id '1-11-1' to 'Indicator name alteration test'

See here: https://github.com/LucyGwilliamAdmin/sdg-data-sandpit/blob/develop/scripts/build_data.py#L36

The checks seem to be passing but I'm not finding '1-11-1' or 'Indicator name alteration test' anywhere

LucyGwilliamAdmin commented 3 years ago

@brockfanning actually, are context['indicator_name'] and meta['indicator_name'] not the same?

brockfanning commented 3 years ago

@LucyGwilliamAdmin The indicator_name setting in the metadata/config is indeed where Open SDG gets the indicator name from. Give this a try - instead of this:

if context['indicator_name'] == 'Proportion of population below the international poverty line, by sex, age, employment status and geographical location (urban/rural)':

try this:

if context['indicator_name'] == 'global_indicators.1-1-1-title':

LucyGwilliamAdmin commented 3 years ago

@brockfanning making that change has changed 1-1-1 to 1-11-1 but the name doesn't seemed to have changed in the built data:

https://lucygwilliamadmin.github.io/sdg-data-sandpit/en/meta/all.json

It looks like it's changed when I print context though: https://github.com/LucyGwilliamAdmin/sdg-data-sandpit/runs/2656864657?check_suite_focus=true#step:5:3724 image

LucyGwilliamAdmin commented 3 years ago

Also should the data also have moved to 1.11.1 - I can see the data alteration happens after the id and name change but don't know how else to move it? Plus the metadata has moved

https://lucygwilliamadmin.github.io/sdg-indicators/1-11-1/

brockfanning commented 3 years ago

@LucyGwilliamAdmin I think this is highlighting the difficulty caused by the fact that each input is "altered" separately.

For example, your id alteration function depends on the indicator name. This works fine during the metadata input, because the indicator name is right there in the metadata. But during the data input, sdg-build has no idea what the indicator name is. All it has is a CSV file. So during the alteration of the data input, context['indicator_name'] will be None.

Put another way - sdg-build operates by taking all of the inputs and then combining them according to their indicator id. But these alteration functions happen before the inputs have been combined. So, though you might expect that the alteration function is happening one time, on the complete indicator - it is actually happening multiple times - once per input, and each time on an incomplete indicator.

An obvious question might be: can't we change it so that the alteration only happens once, after all the inputs have been merged into complete indicators? That would definitely make it behave more intuitively. I think this might be feasible, but could be an impactful change. I'll give it a shot so you can try it out.

brockfanning commented 3 years ago

@LucyGwilliamAdmin Actually it would be a bit easier to try that out after #242 is merged. Otherwise we'll end up with some duplicate code, and #242 introduces that "helper" setup which is a good way to share code. Do you think we could focus on that PR first and then come back to this? I'll set this to draft.

brockfanning commented 2 years ago

@LucyGwilliamAdmin Since this one turned out to be more complicated and less useful than expected, I was thinking we could split it up into smaller parts. I've just opened #285 which contains a portion of this. I'll close this in the meantime, and maybe we can add other portions of this in the future if needed.