Closed semio closed 5 years ago
Interesting ideas! Well put together Semio :)
I would leave out the option for a string, this can already be done by using *
in the object. Also I'm not sure what the use of the .
was in "population_.*" : sum
procedure: groupby
ingredients:
- ingredient_to_run
options:
groupby:
- country
- time
aggregate:
"*" : sum # for all indicators aggregate with sum
result: ingredient_result
Would it make sense to switch the aggregation function and concepts and/or create dictionaries for each aggregation function. This seems more consistent with the other procedures (dictionary contains old columns as values, new values as keys) and allows use to have arrays as values to apply it to multiple columns that are not easily identifiable with wildcards.
We can reuse the string template you mentioned for filter_row
here. Possibly we could make the references to the concept at hand more explicit by using {concept}
instead of {}
? Would that causes troubles when dealing with the concepts table?
procedure: groupby
ingredients:
- ingredient_to_run
options:
groupby:
- country
- time
aggregate:
sum:
"{concept}_sum": "population" or "*" or ["population", "income"] or ["population_*", "income"] # for all indicators aggregate with sum
avg:
"average_income": "income"
result: ingredient_result
Maybe we could make it even more efficient by using your newly introduced string template with a concept which we're filtering with?
procedure: filter_row
ingredients:
- ingredient_to_run
options:
dictionary:
"{concept}_{variant}": # the key is a string template, {} will be replaced by the indicator name
from: "population" # will do this filter for all indicators matching "population_.*"
variant: "estimates" or "*" or ["estimates", "medium"]
result: ingredient_result
In the docs it is old_name: new_name
mapping, here you seem to have turned them around to new_name: old_name
?
https://ddf-utils.readthedocs.io/en/latest/recipe.html#copy
This seems more consistent with the other procedure configs though.
On another note... where do we use copy
? Is it a useful procedure? Don't all other procedures already give the option to not overwrite but write to another concept in the same ingredient? Wouldn't that make copy
superfluous?
Yes, agree. You describe a hard case where you'd have to have some kind of pattern matching or set-matching or whatnot. Let's keep it easy for now.
Thanks for the good suggestions! Let me answer some questions above:
.*
is regular expression and *
is wildcard character , they are both for pattern matching, you can think of regular expression an advance version of wildcard character. But they are not the same thing so we can not support both in recipe. But I think either one will work for recipe.copy
is superfluous, we can always do what copy
do with other procedures. And we are not using it in any recipes. let's remove itI think your suggestions on groupby/filter_row are good to go. I will implement them soon :)
Right, would you want to support the full power (and complexity) of regex? Might be overkill? Are there performance restrictions for regex vs simple wildcards? How do we distinct regex from regular strings?
I just tested both regex module and wildcard module in python standard library to filter a list of 100000 strings, wildcard module took 65.8 ms and regex module took 117 ms to finish (if we compile the regex first, it's 55.3 ms, even faster than wildcard) So both are very fast but wildcard module is a little quicker.
regular expression is used in datapackage-pipeline, but I think we aim to provide tools for non-technical people, wildcard should be a better choice. Is it OK?
we don't need to distinct regex(string with wildcard) from regular strings, we can just treat them as regex(string with wildcard). So 'population' will match only 'population', 'population*' will match all strings begin with 'population'.
right, we're lucky we use concept ids only, but otherwise expressing the string population_*
would need an escape character, now it's not valid anyway so we can expect when *
is used, it's certainly a wildcard. What wildcards does the wildcard module support? *
, +
and ?
?
is it this one https://docs.python.org/3.6/library/fnmatch.html?
One more thing: in filter_row
we have a dictionary which is more than a dictionary. It'd be good to be consistent about what a dictionary is and how it is defined in recipes. Wonder if there's more of these inconsistencies (just like the new_name: old_name
thing)
Can we also have template functionality at the procedure level as well? It would be nice to have feature when dealing with same dimensions over various indicators. For Example:
for indicator in indicators:
- procedure: filter_row
ingredients:
- unpop-datapoints-{indicator}-by-country
options:
dictionary:
for variantval in variants:
{indicator}_{variantval}:
from: {indicator}
variant: {variantval}
result: unpop-datapoints-{indicator}-by-country
@harpalshergill First of all, we don't have for loops in recipes, recipes are meant to be declarative, for loops and other control structures are more procedural/imperative of nature. (in short: Declarative: say what you want as a result but not how. Chef knows how. imperative: say exactly how you want something to be done).
Second, not sure what you mean by "procedure level", we've only been discussing procedures above?
It seems your proposal for the dictionary above with for loops is similar to having the following without for-loops
from: *
variant: *
Another thing you seem to propose is that the string template is found in the input and output ingredient too.
I guess your use case for it would be the unpop--wpp_population
dataset where you want to apply the procedure to all indicators. But I think you can do the same by loading all indicators (which share the same key) in one ingredient and then applying filter_row
on that ingredient.
Maybe, to understand this better, try to disconnect ingredients from the csv files and folder structure in the dataset. Recipes and ingredients are an abstraction layer on top of it. When you load an ingredient with
- id: unpop_datapoints
dataset: ddf--unpop--wpp_population
key: geo,year,variant
value: population,fertility,death_rate
it contains all indicators by geo,year,variant. Chef will find out in what files/folders to find them, you don't have to worry about that. Then, you can apply filter_row on it:
procedure: filter_row
ingredients:
- unpop_datapoints
options:
dictionary:
"{concept}_{variant}": # the key is a string template, {concept} will be replaced by the indicator name
from: ["population", "fertility"] # will do this filter for all indicators
variant: ["estimates", "medium"]
result: ingredient_result
now ingredient_result will contain all indicators, with variants "estimates" and "medium" flattened (e.g. fertility
became fertility_medium
and fertility_estimates
).
Question @semio, will the other variants and indicators be discarded or remain?
E.g. will the result be A. original indicators and not mentioned indicators are kept
geo | year | variant | population | population_estimate | population_medium | fertility | fertility_estimate | fertility_medium | death_rate |
---|---|---|---|---|---|---|---|---|---|
swe | 2015 | medium | 385 | 385 | 3 | 3 | 25 | ||
swe | 2015 | estimate | 583 | 583 | 4 | 4 | 30 | ||
swe | 2015 | high | 484 | 5 | 35 | ||||
... | ... | ... | ... | ... |
or B. original indicators are discarded, not-mentioned indicators are kept
geo | year | variant | population_estimate | population_medium | fertility_estimate | fertility_medium | death_rate |
---|---|---|---|---|---|---|---|
swe | 2015 | medium | 385 | 3 | 25 | ||
swe | 2015 | estimate | 583 | 4 | 30 | ||
swe | 2015 | high | 35 | ||||
... | ... | ... | ... | ... |
or C. not mentioned variants and indicators are discarded
geo | year | population_estimate | population_medium | fertility_estimate | fertility_medium |
---|---|---|---|---|---|
swe | 2015 | 583 | 385 | 4 | 3 |
... | ... | ... | ... |
or something else?
I guess C. is the result we'd want most of the time, with other variations máybe attainable through boolean flags (like you have one now). Though only when we have use cases for them.
- id: unpop_datapoints
dataset: ddf--unpop--wpp_population
key: geo,year,variant
value: population,fertility,death_rate
procedure: filter_row
ingredients:
- unpop_datapoints
options:
dictionary:
"{concept}_{variant}": # the key is a string template, {concept} will be replaced by the indicator name
from: ["population", "fertility"] # will do this filter for all indicators
variant: ["estimates", "medium"]
result: ingredient_result
@jheeffer if the case you mention above works then I am more than happy with the solution for population datasets. It does the same thing which I am asking. Also I am assuming above case would hold for "*" to select all available values for both "from" and "variant" After talking with @semio today, I was under the impression that "result:" would produce only one indicator file that's why I end up discussing template at the procedure level.
also would it be possible if you have more keys say: keys: geo, year, variant, gender, age then can you still following in option->dictionary?: from: variant: age: gender:
@jheeffer
in reply to https://github.com/semio/ddf_utils/issues/66#issuecomment-309734580:
yes, it's fnmatch module. From the doc, it doesn't treat the _
char as special char...
in reply to https://github.com/semio/ddf_utils/issues/66#issuecomment-309737844:
I think filter_row
is only one where dictionary is more than a dictionary. I will double check and let you know if there are more.
in reply to https://github.com/semio/ddf_utils/issues/66#issuecomment-309759898:
Yes, C is good, if we want to include more indicators, we can always do another filter_row
and merge
the results together.
I think @harpalshergill 's suggestion for template in input and result are still useful. So now we have a way to run procedure on multiple indicators in one ingredient, but how about we want to apply same procedure to multiple ingredients?
For example, we have:
And we want to translate the country
to Gapminder geo domain. For now we have to write translate_column
for each of them. Could something like this a good idea:
procedure: translate_column
ingredients:
- ingredient_by_country_year(.*) # regex with groups, the contents inside the () will be used to format the result ids
options:
....
result: ingredient_by_geo_year{}
But this need regular expr and seems too complex. Another way is to set it manually.
procedure: translate_column
ingredients:
- ingredient_by_country_year
- ingredient_by_country_year_gender
- ingredient_by_country_year_age
- ingredient_by_country_year_age_gender
options:
....
result:
- ingredient_by_geo_year
- ingredient_by_geo_year_gender
- ingredient_by_geo_year_age
- ingredient_by_geo_year_age_gender
in reply to https://github.com/semio/ddf_utils/issues/66#issuecomment-309792251:
@harpalshergill yes we can do this, just calculate all combinations for the options and generate the indicator name according to the variable for each entity_set to filter.
@semio, about the multiple ingredients with wildcards: that is exactly the reason for changing the scope of procedures to the dataset. It also seems a more effective way of doing it than going with self-given entity id's and trying to match regex/wildcards to them? I have the feeling that would get quite cluttered and you could easily lose overview of what ingredients actual are affected.
@semio for the dictionary, it's might be easy both for modularity and consistency to have a dictionary class which is used everywhere we have a dictionary in the recipe. That way it's not possible to use a different dictionary and you'll have a nice seperate module for that functionality.
@jheeffer Ok, I see we will benefit from changing to dataset scope. so.. let's add the feature for multiple indicators in one ingredient first and think about the dataset scope later
for the dictionary, I agree it's good to have consistency, so yes it should be good to make all dictionary same format in the recipe. So we might call the dictionary
in filter_row
something else.
for separate module for dictionary, I think you mean to validate user's input, right? We already working on a recipe schema to validate recipes, but schema only validates the format, I agree that it's good have some validation on the data passed to chef too.
@jheeffer @harpalshergill I am now working the filter_row
and here is a question for the new behaviour for this procedure:
{
and }
), the filtered result will be combine into one indicator:procedure: filter_row
ingredients:
- pop_by_age
options:
dictionary:
"pop_age_0_4":
from: "population"
age: ['0', '1', '2', '3', '4']
result: ingredient_result
then the result ingredient will have one indicator population
.
procedure: filter_row
ingredients:
- pop_by_age
options:
dictionary:
"pop_age_{age}":
from: "population"
age: ['0', '1', '2', '3', '4']
result: ingredient_result
the result ingredient will have 5 indicators, pop_age_0
to pop_age_4
.
from
value is only accepted when the key is a template:procedure: filter_row
ingredients:
- pop_by_age
options:
dictionary:
"{concept}_age_{age}":
from: "population*"
age: ['0', '1', '2', '3', '4']
result: ingredient_result
in this case all indicators match population*
will be filtered and split up to a 5 indicators for each of them.
Does this look good to you?
@semio in the first case, how will the values be aggregated? It seems somewhere we imply they must be summed. Implications are not good, these things need to be explicit. What if it needs to be (weighted) averaged (e.g. not population but income). So it seems to me 1 shouldn't possible. Possible solutions:
If multiple values are given for a dimension, there MUST be a template in the indicator for that dimension. Aggregation should be done through the procedures made for it, either first filter_row
, then run_op
or first groupby
then filter_row
.
We add specification of an aggregation method in filter_row
. So when aggregation is specified for a dimension no template allowed and vice versa.
@semio about dictionary: I meant module as in modular software. Cut your code up into smaller pieces, classes, modules, whatever you call them. So when recipe interpretor sees a dictionary
it will always create a dictionary
object from the Dictionary
class, which contains all functionality a chef-dictionary needs ( build from file, build from other ingredient, translate, handle ambiguity etc). The procedures can use the dictionary object for whatever task they have.
Does that make sense?
This made me think, what we basically do here is groupby one or more dimensions and then each group gets it's own indicator... For example:
geo | year | variant | population | fertility | death_rate |
---|---|---|---|---|---|
swe | 2015 | medium | 385 | 3 | 25 |
swe | 2015 | estimate | 583 | 4 | 30 |
swe | 2015 | high | 484 | 5 | 35 |
groupby: variant
Group 1
geo | year | variant | population | fertility | death_rate |
---|---|---|---|---|---|
swe | 2015 | medium | 385 | 3 | 25 |
Group 2
geo | year | variant | population | fertility | death_rate |
---|---|---|---|---|---|
swe | 2015 | estimate | 583 | 4 | 30 |
Group 3
geo | year | variant | population | fertility | death_rate |
---|---|---|---|---|---|
swe | 2015 | high | 484 | 5 | 35 |
I'm trying to read through the pandas docs if their groupby can do what we do with filter_row
but if you have any ideas that'd be good too...
actually I didn't imply aggregate in the first form, the rows filtered will be as is and we should do aggregate later in groupby
, so in my example above, the result will still have an age
column, where the values are limited in 0 to 4. To get population_0_4
indicator without the age column, we still need to run a groupby
:
- procedure: filter_row
ingredients:
- pop_by_age_country_year
options:
dictionary:
"pop_by_age_0_4":
from: "population"
age: ['0', '1', '2', '3', '4']
result: pop_by_age_0_4 # pop_by_age_0_4, primaryKeys are age, country, year
- procedure: groupby
ingredients:
- pop_by_age_0_4
options:
groupby:
- country
- year
aggregate:
sum:
pop_by_age_0_4: population_0_4
result: ingredient_result # population_0_4, primaryKeys are country, year
but I don't like to add groupby option in filter function... it overlaps with existing groupby procedure and, we are getting harder to name the procedure :)
hmm, I think, actually the key: "*"
filter is not very useful, because
- procedure: filter_row
ingredients:
- pop_by_age_country_year
options:
dictionary:
"population_{age}":
from: "population"
age: "*"
result: pop_by_age_0_4 # pop_by_age_0_4, primaryKeys are age, country, year
the dictionary here does no filtering at all but just groupby and rename indicators. But I am not sure what would be the best way now... will think about it a bit and give you update
Yeah! exactly, but that's why I also mean when I say filter_row
is not the right term, it's more like flattening dimensions. It's just that when you don't include every entity in the dimension, you're also filtering some out.
We need the flattening now, since Vizabi doesn't support multi-dimensional data yet.
And yes, sorry, I was wrong about the aggregation, messed it up in my head : )
right.. how about we just create a flatten
procedure? Like this:
procedure: flatten
ingredients:
- ingredient_by_age_gender_country_year
options:
dictionary:
"*": "{concept}_{age}_{gender}"
flatten_dimensions:
- age
- gender
result: ingredient_by_country_year
Or, we can combine this into groupby
, because there is no aggregate/transform/filter to do, we should add a new option:
procedure: groupby
ingredients:
- ingredient_by_age_gender_country_year
options:
groupby:
- age
- gender
flatten:
"*": {concept}_{age}_{gender}
result: ingredient_by_country_year
Although in this flatten dictionary in groupby
, the format is not same as aggregate
you suggested, which has an other level:
aggregate:
sum:
"*": {concept}_new_name
I like to have separate flatten
procedure more, because it looks more descriptive to me.. What do you think? Do you have other ideas?
so filter_row
can only do filtration now, we don't need to remove columns in this procedure any more. the keep_all_column
option can be removed, and key: "*"
will be not supported.
procedure: filter_row
ingredients:
- ingredient_by_age_gender_country_year
options:
dictionary:
"{concept}_0_3_male": # only `concept` is accept as template variable
from: population*
age: [0, 1, 2, 3]
gender: male
"income_0_4_female":
from: income
age: [0, 1, 2, 3, 4]
gender: female
result: output_ingredient
If it truly can only filter:
hmm, right, it's still not only filtering. I thought about removing the new concepts from filter_row too. If filter_row
should not create new concepts, how about groupby
?
Also, it seems to add many workloads to user to achieve something we can easily done currently... the current filter_row
supports running multiple filters to one indicator, like:
- procedure: filter_row
ingredients:
- pop_by_age_country_year
options:
dictionary:
"population_age_0_4":
from: "population"
age: [0, 1, 2, 3, 4]
"population_age_5_9":
from: "population"
age: [5, 6, 7, 8, 9]
# and filter for 10-14, 15-29, 20-39, 40-59, 60-79...
"population_age_80_plus":
from: "population"
age: [80plus]
result: population_age_groups
if we don't let user create new concepts, then we need to run many more procedures:
- procedure: filter_row
ingredients:
- pop_by_age_country_year
options:
dictionary:
population:
age: [0, 1, 2, 3, 4]
result: population_age_0_4
- procedure: translate_header
ingredients:
- population_age_0_4
options:
dictionary:
population: population_age_0_4
result: population_age_0_4_translated
# and 10 more procedures to get population_age_10_14 to population_60_79
- procedure: filter_row
ingredients:
- pop_by_age_country_year
options:
dictionary:
population:
age: [80plus]
result: population_age_80plus
- procedure: translate_header
ingredients:
- population_age_80plus
options:
dictionary:
population: population_age_80plus
result: population_age_0_4_translated
Hmm, to make this easier, we might add a procedure split_datapoints
to split up the data by given groups and create new concepts.
And I think this is a situation where we want to run procedure multiple times with similar options, but changing the scope to dataset doesn't help, and it's where I wish I would borrow the map and partial from functional programming to recipes. But I am not sure how to do this now..
yes, filter for groupby can do filtering, I wasn't aware of that :) But currently aggregate
/transform
/filter
of groupby
share the same format, we should change the format for filter
to be like the dictionary
of filter_row
.
- procedure: groupby
ingredients:
- pop_by_age_country_year
options:
groupby:
- age
filter:
population:
age: [0, 1, 2, 3, 4]
result: population_age_0_4
but the problem I describe above still exists here.
Summary of my ideas:
filter_row
should not create new concepts, shouldgroupby
create new concepts?filter_row
should not create new concepts, it's not easy to write recipe when we want to run multiple filters to one indicator, maybe we should find a way to solve that.But to create new concepts is what flatten_dimension
would do right? What does the current filter_row
(or group_by-> filter) do that flatten_dimension
like you proposed wouldn't be able to?
the difference is that filter_row
doesn't remove dimensions, and only create one new concept at a time, while flatten_dimension
remove dimensions and create multiple new concepts at a time.
so if the input is like this:
country | year | age | population | variant |
---|---|---|---|---|
swe | 2010 | 0 | 5000 | estimates |
swe | 2010 | 1 | 6000 | estimates |
swe | 2010 | 2 | 7000 | estimates |
swe | 2010 | 3 | 8000 | estimates |
swe | 2010 | 4 | 10000 | estimates |
swe | 2010 | 0 | 5001 | medium |
swe | 2010 | 1 | 6001 | medium |
swe | 2010 | 2 | 7001 | medium |
swe | 2010 | 3 | 8001 | medium |
swe | 2010 | 4 | 10001 | medium |
filter_row
for age = [0, 1, 2], new concept name = "{concept}_0_2" returns
country | year | age | population_0_2 | variant |
---|---|---|---|---|
swe | 2010 | 0 | 5000 | estimates |
swe | 2010 | 1 | 6000 | estimates |
swe | 2010 | 2 | 7000 | estimates |
swe | 2010 | 0 | 5001 | medium |
swe | 2010 | 1 | 6001 | medium |
swe | 2010 | 2 | 7001 | medium |
and flatten
for dimension = age, new concept name = "{concept}_{age}" returns:
country | year | variant | population_0 | population_1 | population_2 | population_3 | population_4 |
---|---|---|---|---|---|---|---|
swe | 2010 | estimates | 5000 | 6000 | 7000 | 8000 | 10000 |
swe | 2010 | medium | 5001 | 6001 | 7001 | 8001 | 10001 |
but flatten for age is not useful to us. What we want to do is to run flatten
with dimension = variant, which returns
country | year | age | population_estimates | population_medium |
---|---|---|---|---|
swe | 2010 | 0 | 5000 | 5001 |
swe | 2010 | 1 | 6000 | 6001 |
swe | 2010 | 2 | 7000 | 7001 |
swe | 2010 | 3 | 8000 | 8001 |
swe | 2010 | 4 | 10000 | 10001 |
But wouldn't flatten
for dimension = age, new concept name = population_0_4
give
country year population_0_4
swe 2010 5000
swe 2010 6000
swe 2010 7000
swe 2010 8000
not sure what this the result of filter_row
would be useful for? (below) If really needed, we can add flags to trigger the behaviour to flatten
, which we have now in filter_row
I believe?
I think in essence, we'd rename filter_row
to flatten
and have a new procedure filter_row
or groupby->filter
which does only filtering of rows.
country year age population_0_4
swe 2010 0 5000
swe 2010 1 6000
swe 2010 2 7000
swe 2010 3 8000
OK.. I see your point, but there are duplicated keys in the first result
country year population_0_4
swe 2010 5000
swe 2010 6000
swe 2010 7000
swe 2010 8000
not sure if it would be good?
the filtered result is useful when we want to run calculation on a part of the data. so in the above case, the result is used in a groupby
procedure to calculate population_aged_0_3
:
country year population_0_3
swe 2010 26000 # sum([5000, 6000, 7000, 8000])
Oh shit, sorry, I didn't think it through, we shouldn't have duplicated keys indeed. Every procedure's result should be a valid table/ingredient
Okay, @jheeffer @harpalshergill I made following changes:
filter_row
can't add new concepts now. only filtering rows from existing names (wildcard supported) and return ingredient with same namesgroup_by
now supports wildcard in the concept names. (it also does not support creating new concepts)flatten
procedure, as in the first suggestion of https://github.com/semio/ddf_utils/issues/66#issuecomment-310553639Sorry yes you're right. Got confused!On Jun 22, 2017 12:13, Semio Zheng notifications@github.com wrote:actually I didn't imply aggregate in the first form, the rows filtered will be as is and we should do aggregate later in groupby, so in my example above, the result will still have an age column, where the values are limited in 0 to 4. To get population_0_4 indicator without the age column, we still need to run a groupby:
procedure: filter_row ingredients:
procedure: groupby ingredients:
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.
currently some of the procedures can't easily apply for all indicators in an ingredient, which means we have to specify which indicators to run. These procedures are
groupby
,run_op
,window
,filter_row
andcopy
.It would be good if we can add options to apply on all indicators or easily select a subset of all indicators. To add this feature, I suggests we use pattern matching and templates in these procedures:
run_op is a little bit tricky, because calculations can involve many indicators. For example, we calculate new indicator sex_ratio_0_4 with 2 existing indicators:
so if we want to also make sex_ratio_5_9 and so on.. it's not easy to tell the program how find the correct indicators... but for now we might add a feature to run simple calculation for all indicators: