semio / ddf_utils

Utilities for working with DDF datasets
https://open-numbers.github.io/
MIT License
3 stars 1 forks source link

make some procedures easier to apply for all or subset of indicators in an ingredient #66

Closed semio closed 5 years ago

semio commented 7 years ago

currently some of the procedures can't easily apply for all indicators in an ingredient, which means we have to specify which indicators to run. These procedures are groupby, run_op, window, filter_row and copy .

It would be good if we can add options to apply on all indicators or easily select a subset of all indicators. To add this feature, I suggests we use pattern matching and templates in these procedures:

  1. groupby and window:
procedure: groupby
ingredients:
    - ingredient_to_run
options:
    groupby:
        - country
        - time
    aggregate: sum    # if it's a string, the method will apply to all indicator
    aggregate:     # or if it's an object
        "population_.*" : sum # for all indicators matching "population_.*", will aggregate with sum
result: ingredient_result
procedure: window
ingredients:
    - ingredient_to_run
options:
    window:
        column: time
        size: 10 
        aggregate: sum    # if it's a string, the method will apply to all indicator
        aggregate:     # or if it's an object
            "population_.*" : sum # for all indicators matching "population_.*", will calculate the rolling sum
result: ingredient_result
  1. filter_row and copy
procedure: filter_row
ingredients:
    - ingredient_to_run
options:
    dictionary:
        "{}_filtered":   # the key is a string template, {} will be replaced by the indicator name
              from:  "populatioin_.*"  # will do this filter for all indicators matching "population_.*"
              varient: "estimates" 
result: ingredient_result 
procedure: copy
ingredients:
    - ingredient_to_run
options:
    dictionary:
       # the key is a string template, {} will be replaced by the indicator name, value is a regex
        "{}_copied":   "populatioin_.*"  # will make copies for all indicators matching "population_.*"
result: ingredient_result 
  1. run_op

run_op is a little bit tricky, because calculations can involve many indicators. For example, we calculate new indicator sex_ratio_0_4 with 2 existing indicators:

procedure: run_op
ingredients:
    - ingredient_to_run
options:
    op:
        sex_ratio_0_4: population_0_4_male / population_0_4_female   
result: ingredient_result

so if we want to also make sex_ratio_5_9 and so on.. it's not easy to tell the program how find the correct indicators... but for now we might add a feature to run simple calculation for all indicators:

procedure: run_op
ingredients:
    - ingredient_to_run
options:
    op:
        "*": {} * 1000  # multiple all indicators by 1000   
result: ingredient_result
jheeffer commented 7 years ago

Interesting ideas! Well put together Semio :)

For groupby/window:

I would leave out the option for a string, this can already be done by using * in the object. Also I'm not sure what the use of the . was in "population_.*" : sum

procedure: groupby
ingredients:
    - ingredient_to_run
options:
    groupby:
        - country
        - time
    aggregate:
        "*" : sum # for all indicators aggregate with sum
result: ingredient_result

Would it make sense to switch the aggregation function and concepts and/or create dictionaries for each aggregation function. This seems more consistent with the other procedures (dictionary contains old columns as values, new values as keys) and allows use to have arrays as values to apply it to multiple columns that are not easily identifiable with wildcards. We can reuse the string template you mentioned for filter_row here. Possibly we could make the references to the concept at hand more explicit by using {concept} instead of {}? Would that causes troubles when dealing with the concepts table?

procedure: groupby
ingredients:
    - ingredient_to_run
options:
    groupby:
        - country
        - time
    aggregate:
        sum: 
           "{concept}_sum": "population" or "*" or ["population", "income"] or ["population_*", "income"] # for all indicators aggregate with sum
        avg: 
           "average_income": "income"
result: ingredient_result

For filter_row:

Maybe we could make it even more efficient by using your newly introduced string template with a concept which we're filtering with?

procedure: filter_row
ingredients:
    - ingredient_to_run
options:
    dictionary:
        "{concept}_{variant}":   # the key is a string template, {} will be replaced by the indicator name
              from:  "population"  # will do this filter for all indicators matching "population_.*"
              variant: "estimates" or "*" or ["estimates", "medium"] 
result: ingredient_result 

For copy:

In the docs it is old_name: new_name mapping, here you seem to have turned them around to new_name: old_name? https://ddf-utils.readthedocs.io/en/latest/recipe.html#copy This seems more consistent with the other procedure configs though. On another note... where do we use copy? Is it a useful procedure? Don't all other procedures already give the option to not overwrite but write to another concept in the same ingredient? Wouldn't that make copy superfluous?

for run_op

Yes, agree. You describe a hard case where you'd have to have some kind of pattern matching or set-matching or whatnot. Let's keep it easy for now.

semio commented 7 years ago

Thanks for the good suggestions! Let me answer some questions above:

I think your suggestions on groupby/filter_row are good to go. I will implement them soon :)

jheeffer commented 7 years ago

Right, would you want to support the full power (and complexity) of regex? Might be overkill? Are there performance restrictions for regex vs simple wildcards? How do we distinct regex from regular strings?

semio commented 7 years ago

I just tested both regex module and wildcard module in python standard library to filter a list of 100000 strings, wildcard module took 65.8 ms and regex module took 117 ms to finish (if we compile the regex first, it's 55.3 ms, even faster than wildcard) So both are very fast but wildcard module is a little quicker.

regular expression is used in datapackage-pipeline, but I think we aim to provide tools for non-technical people, wildcard should be a better choice. Is it OK?

we don't need to distinct regex(string with wildcard) from regular strings, we can just treat them as regex(string with wildcard). So 'population' will match only 'population', 'population*' will match all strings begin with 'population'.

jheeffer commented 7 years ago

right, we're lucky we use concept ids only, but otherwise expressing the string population_* would need an escape character, now it's not valid anyway so we can expect when * is used, it's certainly a wildcard. What wildcards does the wildcard module support? *, + and ??

is it this one https://docs.python.org/3.6/library/fnmatch.html?

jheeffer commented 7 years ago

One more thing: in filter_row we have a dictionary which is more than a dictionary. It'd be good to be consistent about what a dictionary is and how it is defined in recipes. Wonder if there's more of these inconsistencies (just like the new_name: old_name thing)

harpalshergill commented 7 years ago

Can we also have template functionality at the procedure level as well? It would be nice to have feature when dealing with same dimensions over various indicators. For Example:

for indicator in indicators:
              - procedure: filter_row
                ingredients:
                    - unpop-datapoints-{indicator}-by-country
                options:
                    dictionary:
                        for variantval in variants:
                          {indicator}_{variantval}:
                            from: {indicator}
                            variant: {variantval}
                result: unpop-datapoints-{indicator}-by-country
jheeffer commented 7 years ago

@harpalshergill First of all, we don't have for loops in recipes, recipes are meant to be declarative, for loops and other control structures are more procedural/imperative of nature. (in short: Declarative: say what you want as a result but not how. Chef knows how. imperative: say exactly how you want something to be done).

Second, not sure what you mean by "procedure level", we've only been discussing procedures above?

It seems your proposal for the dictionary above with for loops is similar to having the following without for-loops

from: *
variant: *

Another thing you seem to propose is that the string template is found in the input and output ingredient too. I guess your use case for it would be the unpop--wpp_population dataset where you want to apply the procedure to all indicators. But I think you can do the same by loading all indicators (which share the same key) in one ingredient and then applying filter_row on that ingredient.

Maybe, to understand this better, try to disconnect ingredients from the csv files and folder structure in the dataset. Recipes and ingredients are an abstraction layer on top of it. When you load an ingredient with

  - id: unpop_datapoints
    dataset: ddf--unpop--wpp_population
    key: geo,year,variant
    value: population,fertility,death_rate

it contains all indicators by geo,year,variant. Chef will find out in what files/folders to find them, you don't have to worry about that. Then, you can apply filter_row on it:

procedure: filter_row
ingredients:
    - unpop_datapoints
options:
    dictionary:
        "{concept}_{variant}":   # the key is a string template, {concept} will be replaced by the indicator name
              from:  ["population", "fertility"]  # will do this filter for all indicators
              variant: ["estimates", "medium"] 
result: ingredient_result 

now ingredient_result will contain all indicators, with variants "estimates" and "medium" flattened (e.g. fertility became fertility_medium and fertility_estimates).

Question @semio, will the other variants and indicators be discarded or remain?

E.g. will the result be A. original indicators and not mentioned indicators are kept

geo year variant population population_estimate population_medium fertility fertility_estimate fertility_medium death_rate
swe 2015 medium 385 385 3 3 25
swe 2015 estimate 583 583 4 4 30
swe 2015 high 484 5 35
... ... ... ... ...

or B. original indicators are discarded, not-mentioned indicators are kept

geo year variant population_estimate population_medium fertility_estimate fertility_medium death_rate
swe 2015 medium 385 3 25
swe 2015 estimate 583 4 30
swe 2015 high 35
... ... ... ... ...

or C. not mentioned variants and indicators are discarded

geo year population_estimate population_medium fertility_estimate fertility_medium
swe 2015 583 385 4 3
... ... ... ...

or something else?

I guess C. is the result we'd want most of the time, with other variations máybe attainable through boolean flags (like you have one now). Though only when we have use cases for them.

harpalshergill commented 7 years ago
- id: unpop_datapoints
    dataset: ddf--unpop--wpp_population
    key: geo,year,variant
    value: population,fertility,death_rate

procedure: filter_row
ingredients:
    - unpop_datapoints
options:
    dictionary:
        "{concept}_{variant}":   # the key is a string template, {concept} will be replaced by the indicator name
              from:  ["population", "fertility"]  # will do this filter for all indicators
              variant: ["estimates", "medium"] 
result: ingredient_result 

@jheeffer if the case you mention above works then I am more than happy with the solution for population datasets. It does the same thing which I am asking. Also I am assuming above case would hold for "*" to select all available values for both "from" and "variant" After talking with @semio today, I was under the impression that "result:" would produce only one indicator file that's why I end up discussing template at the procedure level.

harpalshergill commented 7 years ago

also would it be possible if you have more keys say: keys: geo, year, variant, gender, age then can you still following in option->dictionary?: from: variant: age: gender:

semio commented 7 years ago

@jheeffer

in reply to https://github.com/semio/ddf_utils/issues/66#issuecomment-309734580:

yes, it's fnmatch module. From the doc, it doesn't treat the _ char as special char...

in reply to https://github.com/semio/ddf_utils/issues/66#issuecomment-309737844:

I think filter_row is only one where dictionary is more than a dictionary. I will double check and let you know if there are more.

in reply to https://github.com/semio/ddf_utils/issues/66#issuecomment-309759898:

Yes, C is good, if we want to include more indicators, we can always do another filter_row and merge the results together.

I think @harpalshergill 's suggestion for template in input and result are still useful. So now we have a way to run procedure on multiple indicators in one ingredient, but how about we want to apply same procedure to multiple ingredients?

For example, we have:

And we want to translate the country to Gapminder geo domain. For now we have to write translate_column for each of them. Could something like this a good idea:

procedure: translate_column
ingredients:
    - ingredient_by_country_year(.*)  # regex with groups, the contents inside the () will be used to format the result ids
options:
    ....
result: ingredient_by_geo_year{} 

But this need regular expr and seems too complex. Another way is to set it manually.

procedure: translate_column
ingredients:
    - ingredient_by_country_year
    - ingredient_by_country_year_gender
    - ingredient_by_country_year_age
    - ingredient_by_country_year_age_gender 
options:
    ....
result: 
    - ingredient_by_geo_year
    - ingredient_by_geo_year_gender
    - ingredient_by_geo_year_age
    - ingredient_by_geo_year_age_gender 

in reply to https://github.com/semio/ddf_utils/issues/66#issuecomment-309792251:

@harpalshergill yes we can do this, just calculate all combinations for the options and generate the indicator name according to the variable for each entity_set to filter.

jheeffer commented 7 years ago

@semio, about the multiple ingredients with wildcards: that is exactly the reason for changing the scope of procedures to the dataset. It also seems a more effective way of doing it than going with self-given entity id's and trying to match regex/wildcards to them? I have the feeling that would get quite cluttered and you could easily lose overview of what ingredients actual are affected.

jheeffer commented 7 years ago

@semio for the dictionary, it's might be easy both for modularity and consistency to have a dictionary class which is used everywhere we have a dictionary in the recipe. That way it's not possible to use a different dictionary and you'll have a nice seperate module for that functionality.

semio commented 7 years ago

@jheeffer Ok, I see we will benefit from changing to dataset scope. so.. let's add the feature for multiple indicators in one ingredient first and think about the dataset scope later

for the dictionary, I agree it's good to have consistency, so yes it should be good to make all dictionary same format in the recipe. So we might call the dictionary in filter_row something else.

for separate module for dictionary, I think you mean to validate user's input, right? We already working on a recipe schema to validate recipes, but schema only validates the format, I agree that it's good have some validation on the data passed to chef too.

semio commented 7 years ago

@jheeffer @harpalshergill I am now working the filter_row and here is a question for the new behaviour for this procedure:

  1. When the key of dictionary is not a template (doesn't contain { and }), the filtered result will be combine into one indicator:
procedure: filter_row
ingredients:
    - pop_by_age
options:
    dictionary:
        "pop_age_0_4": 
              from:  "population"
              age:  ['0', '1', '2', '3', '4']
result: ingredient_result 

then the result ingredient will have one indicator population.

  1. When the the key of dictionary is a template, the result will be many indicators:
procedure: filter_row
ingredients:
    - pop_by_age
options:
    dictionary:
        "pop_age_{age}": 
              from:  "population"
              age:  ['0', '1', '2', '3', '4']
result: ingredient_result 

the result ingredient will have 5 indicators, pop_age_0 to pop_age_4.

  1. Wildcard matching for the from value is only accepted when the key is a template:
procedure: filter_row
ingredients:
    - pop_by_age
options:
    dictionary:
        "{concept}_age_{age}": 
              from:  "population*"
              age:  ['0', '1', '2', '3', '4']
result: ingredient_result 

in this case all indicators match population* will be filtered and split up to a 5 indicators for each of them.

Does this look good to you?

jheeffer commented 7 years ago

@semio in the first case, how will the values be aggregated? It seems somewhere we imply they must be summed. Implications are not good, these things need to be explicit. What if it needs to be (weighted) averaged (e.g. not population but income). So it seems to me 1 shouldn't possible. Possible solutions:

  1. If multiple values are given for a dimension, there MUST be a template in the indicator for that dimension. Aggregation should be done through the procedures made for it, either first filter_row, then run_op or first groupby then filter_row.

  2. We add specification of an aggregation method in filter_row. So when aggregation is specified for a dimension no template allowed and vice versa.

jheeffer commented 7 years ago

@semio about dictionary: I meant module as in modular software. Cut your code up into smaller pieces, classes, modules, whatever you call them. So when recipe interpretor sees a dictionary it will always create a dictionary object from the Dictionary class, which contains all functionality a chef-dictionary needs ( build from file, build from other ingredient, translate, handle ambiguity etc). The procedures can use the dictionary object for whatever task they have. Does that make sense?

jheeffer commented 7 years ago

This made me think, what we basically do here is groupby one or more dimensions and then each group gets it's own indicator... For example:

geo year variant population fertility death_rate
swe 2015 medium 385 3 25
swe 2015 estimate 583 4 30
swe 2015 high 484 5 35
groupby: variant

Group 1

geo year variant population fertility death_rate
swe 2015 medium 385 3 25

Group 2

geo year variant population fertility death_rate
swe 2015 estimate 583 4 30

Group 3

geo year variant population fertility death_rate
swe 2015 high 484 5 35

I'm trying to read through the pandas docs if their groupby can do what we do with filter_row but if you have any ideas that'd be good too...

semio commented 7 years ago

actually I didn't imply aggregate in the first form, the rows filtered will be as is and we should do aggregate later in groupby, so in my example above, the result will still have an age column, where the values are limited in 0 to 4. To get population_0_4 indicator without the age column, we still need to run a groupby:

- procedure: filter_row
  ingredients:
      - pop_by_age_country_year
  options:
      dictionary:
          "pop_by_age_0_4": 
                from:  "population"
                age:  ['0', '1', '2', '3', '4']
  result: pop_by_age_0_4   # pop_by_age_0_4, primaryKeys are age, country, year

- procedure: groupby
  ingredients:
      - pop_by_age_0_4
  options:
      groupby:
          - country
          - year
      aggregate:
          sum:
                pop_by_age_0_4: population_0_4
  result: ingredient_result  # population_0_4, primaryKeys are country, year

but I don't like to add groupby option in filter function... it overlaps with existing groupby procedure and, we are getting harder to name the procedure :)

semio commented 7 years ago

hmm, I think, actually the key: "*" filter is not very useful, because

- procedure: filter_row
  ingredients:
      - pop_by_age_country_year
  options:
      dictionary:
          "population_{age}":
                from:  "population"
                age:  "*"
  result: pop_by_age_0_4   # pop_by_age_0_4, primaryKeys are age, country, year

the dictionary here does no filtering at all but just groupby and rename indicators. But I am not sure what would be the best way now... will think about it a bit and give you update

jheeffer commented 7 years ago

Yeah! exactly, but that's why I also mean when I say filter_row is not the right term, it's more like flattening dimensions. It's just that when you don't include every entity in the dimension, you're also filtering some out. We need the flattening now, since Vizabi doesn't support multi-dimensional data yet.

jheeffer commented 7 years ago

And yes, sorry, I was wrong about the aggregation, messed it up in my head : )

semio commented 7 years ago

right.. how about we just create a flatten procedure? Like this:

procedure: flatten
ingredients:
    - ingredient_by_age_gender_country_year
options:
    dictionary:
        "*": "{concept}_{age}_{gender}"
    flatten_dimensions:
        - age
        - gender
result: ingredient_by_country_year

Or, we can combine this into groupby, because there is no aggregate/transform/filter to do, we should add a new option:

procedure: groupby
ingredients:
    - ingredient_by_age_gender_country_year
options:
    groupby:
        - age
        - gender
    flatten:
        "*": {concept}_{age}_{gender}
result: ingredient_by_country_year

Although in this flatten dictionary in groupby, the format is not same as aggregate you suggested, which has an other level:

aggregate:
    sum:
        "*": {concept}_new_name

I like to have separate flatten procedure more, because it looks more descriptive to me.. What do you think? Do you have other ideas?

semio commented 7 years ago

so filter_row can only do filtration now, we don't need to remove columns in this procedure any more. the keep_all_column option can be removed, and key: "*" will be not supported.

procedure: filter_row
ingredients:
    - ingredient_by_age_gender_country_year
options:
    dictionary:
        "{concept}_0_3_male":  # only `concept` is accept as template variable
            from: population*
            age: [0, 1, 2, 3]
            gender: male
        "income_0_4_female":
            from: income
            age: [0, 1, 2, 3, 4]
            gender: female
result: output_ingredient
jheeffer commented 7 years ago

If it truly can only filter:

semio commented 7 years ago

hmm, right, it's still not only filtering. I thought about removing the new concepts from filter_row too. If filter_row should not create new concepts, how about groupby?

Also, it seems to add many workloads to user to achieve something we can easily done currently... the current filter_row supports running multiple filters to one indicator, like:

- procedure: filter_row
  ingredients:
      - pop_by_age_country_year
  options:
      dictionary:
          "population_age_0_4":
                from:  "population"
                age:  [0, 1, 2, 3, 4]
          "population_age_5_9":
                from: "population"
                age: [5, 6, 7, 8, 9]
           # and filter for 10-14, 15-29, 20-39, 40-59, 60-79...
           "population_age_80_plus":
                from: "population"
                age: [80plus]
  result: population_age_groups

if we don't let user create new concepts, then we need to run many more procedures:

- procedure: filter_row
  ingredients:
      - pop_by_age_country_year
  options:
      dictionary:
          population:
               age: [0, 1, 2, 3, 4]
  result: population_age_0_4
- procedure: translate_header
  ingredients:
      - population_age_0_4
  options:
      dictionary:
          population: population_age_0_4
  result: population_age_0_4_translated
# and 10 more procedures to get population_age_10_14 to population_60_79
- procedure: filter_row
  ingredients:
      - pop_by_age_country_year
  options:
      dictionary:
          population:
               age: [80plus]
  result: population_age_80plus
- procedure: translate_header
  ingredients:
      - population_age_80plus
  options:
      dictionary:
          population: population_age_80plus
  result: population_age_0_4_translated

Hmm, to make this easier, we might add a procedure split_datapoints to split up the data by given groups and create new concepts.

And I think this is a situation where we want to run procedure multiple times with similar options, but changing the scope to dataset doesn't help, and it's where I wish I would borrow the map and partial from functional programming to recipes. But I am not sure how to do this now..


yes, filter for groupby can do filtering, I wasn't aware of that :) But currently aggregate/transform/filter of groupby share the same format, we should change the format for filter to be like the dictionary of filter_row.

- procedure: groupby
  ingredients:
      - pop_by_age_country_year
  options:
      groupby:
          - age
      filter:
          population:
              age: [0, 1, 2, 3, 4]
  result: population_age_0_4

but the problem I describe above still exists here.


Summary of my ideas:

jheeffer commented 7 years ago

But to create new concepts is what flatten_dimension would do right? What does the current filter_row (or group_by-> filter) do that flatten_dimension like you proposed wouldn't be able to?

semio commented 7 years ago

the difference is that filter_row doesn't remove dimensions, and only create one new concept at a time, while flatten_dimension remove dimensions and create multiple new concepts at a time.

so if the input is like this:

country year age population variant
swe 2010 0 5000 estimates
swe 2010 1 6000 estimates
swe 2010 2 7000 estimates
swe 2010 3 8000 estimates
swe 2010 4 10000 estimates
swe 2010 0 5001 medium
swe 2010 1 6001 medium
swe 2010 2 7001 medium
swe 2010 3 8001 medium
swe 2010 4 10001 medium

filter_row for age = [0, 1, 2], new concept name = "{concept}_0_2" returns

country year age population_0_2 variant
swe 2010 0 5000 estimates
swe 2010 1 6000 estimates
swe 2010 2 7000 estimates
swe 2010 0 5001 medium
swe 2010 1 6001 medium
swe 2010 2 7001 medium

and flatten for dimension = age, new concept name = "{concept}_{age}" returns:

country year variant population_0 population_1 population_2 population_3 population_4
swe 2010 estimates 5000 6000 7000 8000 10000
swe 2010 medium 5001 6001 7001 8001 10001

but flatten for age is not useful to us. What we want to do is to run flatten with dimension = variant, which returns

country year age population_estimates population_medium
swe 2010 0 5000 5001
swe 2010 1 6000 6001
swe 2010 2 7000 7001
swe 2010 3 8000 8001
swe 2010 4 10000 10001
jheeffer commented 7 years ago

But wouldn't flatten for dimension = age, new concept name = population_0_4 give

country year    population_0_4
swe 2010    5000
swe 2010    6000
swe 2010    7000
swe 2010    8000

not sure what this the result of filter_row would be useful for? (below) If really needed, we can add flags to trigger the behaviour to flatten, which we have now in filter_row I believe? I think in essence, we'd rename filter_row to flatten and have a new procedure filter_row or groupby->filter which does only filtering of rows.

country year    age population_0_4
swe 2010    0   5000
swe 2010    1   6000
swe 2010    2   7000
swe 2010    3   8000
semio commented 7 years ago

OK.. I see your point, but there are duplicated keys in the first result

country year    population_0_4
swe 2010    5000
swe 2010    6000
swe 2010    7000
swe 2010    8000

not sure if it would be good?

the filtered result is useful when we want to run calculation on a part of the data. so in the above case, the result is used in a groupby procedure to calculate population_aged_0_3:

country year    population_0_3
swe 2010    26000                # sum([5000, 6000, 7000, 8000])
jheeffer commented 7 years ago

Oh shit, sorry, I didn't think it through, we shouldn't have duplicated keys indeed. Every procedure's result should be a valid table/ingredient

semio commented 7 years ago

Okay, @jheeffer @harpalshergill I made following changes:

jheeffer commented 7 years ago

Sorry yes you're right. Got confused!On Jun 22, 2017 12:13, Semio Zheng notifications@github.com wrote:actually I didn't imply aggregate in the first form, the rows filtered will be as is and we should do aggregate later in groupby, so in my example above, the result will still have an age column, where the values are limited in 0 to 4. To get population_0_4 indicator without the age column, we still need to run a groupby:

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.