semio / ddf_utils

Utilities for working with DDF datasets
https://open-numbers.github.io/
MIT License
2 stars 1 forks source link

recipe procedure: filter_row #2

Closed semio closed 7 years ago

semio commented 8 years ago

Reason

Sometimes we want to use a part of a datapoints in a concept as a new concept. For example, we have total deaths caused by storm, flood, earthquake... in one concept, and we want to make a new concept call total_deaths_storm.

API

the filter_row procedure filter an ingredient based on a set of options and return the filtered result as a new ingredient.

filter_row procedure accepts following options:

{ 
    new_key_in_new_ingredient: 
    { 
        from: old_key_in_old_ingredient,
        filter_col_1: filter_val_1,
        filter_col_2: filter_val_2,
        # ...
    }
}    
semio commented 8 years ago

example:

https://github.com/semio/ddf--gapminder--systema_globalis/blob/feature/autogenerated/etl/recipes/recipe_cred.yaml

jheeffer commented 7 years ago

Aaaaah, I think I had a different idea about filter_row than is described here... I thought of it as a WHERE statement in a SQL query. That filter decides which rows from a dataframe will be in the result. It seems you're doing something else though, can you describe the functionality more, maybe with an example (also for the sake of documentation)? (Including the dropping of columns you described in #14 )

semio commented 7 years ago

Sure, let me explain the details. Take the cred dataset for example:

- procedure: filter_row
   ingredients: [cred-datapoints-aligned]
   options:
   dictionary:
     drought_affected_annual_number:
       from: affected
       disaster: drought
     earthquake_affected_annual_number:
       from: affected
       disaster: earthquake
   result: cred-datapoints-filtered

The ingredient, cred-datapoints-aligned have key: "geo,disaster,year" so the table is like this:

geo year disaster affected
usa 1990 drought 100
usa 1991 drought 200
usa 1990 earthquake 5000
usa 1991 earthquake 2000
chn 1990 drought 50
chn 1991 drought 150
chn 1990 earthquake 3000
chn 1991 earthquake 10000

What we want to get is: (and also below is what will return with the procedure)

geo year drought_affected_annual_number earthquake_affected_annual_number
usa 1990 100 5000
usa 1991 200 2000
chn 1990 50 3000
chn 1991 150 1000

So the procedure will do the following steps:

1) created 2 new columns drought_affected_annual_number and earthquake_affected_annual_number:

The result will be

geo year disaster drought_affected_annual_number earthquake_affected_annual_number
usa 1990 drought 100
usa 1991 drought 200
chn 1990 drought 50
chn 1991 drought 150
usa 1990 earthquake 5000
usa 1991 earthquake 2000
chn 1990 earthquake 3000
chn 1991 earthquake 1000

2) Consider the disaster column for each of drought_affected_annual_number and earthquake_affected_annual_number. Because they have unique values (drought for all drought_affected_annual_number datapoints), the procedure will drop this column. It won't drop them if there are multiple values in it.

A use case where there are multiple values in the column being filtered is the unpop recipe:

male_population_0_14_years:
                      from: population
                      gender: male
                      age: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14']
female_population_0_14_years:
                      from: population
                      gender: female
                      age: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14']

There are a list of values in the age column after filtered. So it won't be dropped. gender will be dropped because it only have one value in filter. The age column will be removed later in the groupby procedure.

In fact step 1 is just like WHERE in SQL, but we don't need to SELECT the dimension columns. We say from: population and gender: male in recipe, it's just like this SQL:

SELECT `geo`, `year`, `age`, `gender`, `population` FROM table
  WHERE `gender`='male' ;

Besides, for now the filter can only test equality to one value or membership to list of values. Other operators like gt, lt , like is not supported but it should be not difficult to add these to the procedure.