recipe procedure: filter_row

semio commented 8 years ago

Reason

Sometimes we want to use a part of a datapoints in a concept as a new concept. For example, we have total deaths caused by storm, flood, earthquake... in one concept, and we want to make a new concept call total_deaths_storm.

API

the filter_row procedure filter an ingredient based on a set of options and return the filtered result as a new ingredient.

filter_row procedure accepts following options:

ingredients: ingredients to filter
dictionary: a list of dictionaries, each dict have following format:

{ 
    new_key_in_new_ingredient: 
    { 
        from: old_key_in_old_ingredient,
        filter_col_1: filter_val_1,
        filter_col_2: filter_val_2,
        # ...
    }
}

semio commented 8 years ago

example:

https://github.com/semio/ddf--gapminder--systema_globalis/blob/feature/autogenerated/etl/recipes/recipe_cred.yaml

jheeffer commented 7 years ago

Aaaaah, I think I had a different idea about filter_row than is described here... I thought of it as a WHERE statement in a SQL query. That filter decides which rows from a dataframe will be in the result. It seems you're doing something else though, can you describe the functionality more, maybe with an example (also for the sake of documentation)? (Including the dropping of columns you described in #14 )

semio commented 7 years ago

Sure, let me explain the details. Take the cred dataset for example:

- procedure: filter_row
   ingredients: [cred-datapoints-aligned]
   options:
   dictionary:
     drought_affected_annual_number:
       from: affected
       disaster: drought
     earthquake_affected_annual_number:
       from: affected
       disaster: earthquake
   result: cred-datapoints-filtered

The ingredient, cred-datapoints-aligned have key: "geo,disaster,year" so the table is like this:

geo	year	disaster	affected
usa	1990	drought	100
usa	1991	drought	200
usa	1990	earthquake	5000
usa	1991	earthquake	2000
chn	1990	drought	50
chn	1991	drought	150
chn	1990	earthquake	3000
chn	1991	earthquake	10000

What we want to get is: (and also below is what will return with the procedure)

geo	year	drought_affected_annual_number	earthquake_affected_annual_number
usa	1990	100	5000
usa	1991	200	2000
chn	1990	50	3000
chn	1991	150	1000

So the procedure will do the following steps:

1) created 2 new columns drought_affected_annual_number and earthquake_affected_annual_number:

drought_affected_annual_number is filtered from affected column in ingredient, where disaster == 'drought'
earthquake_affected_annual_number is filtered from affected column in ingredient, where disaster == 'earthquake'

The result will be

geo	year	disaster	drought_affected_annual_number	earthquake_affected_annual_number
usa	1990	drought	100
usa	1991	drought	200
chn	1990	drought	50
chn	1991	drought	150
usa	1990	earthquake		5000
usa	1991	earthquake		2000
chn	1990	earthquake		3000
chn	1991	earthquake		1000

2) Consider the disaster column for each of drought_affected_annual_number and earthquake_affected_annual_number. Because they have unique values (drought for all drought_affected_annual_number datapoints), the procedure will drop this column. It won't drop them if there are multiple values in it.

A use case where there are multiple values in the column being filtered is the unpop recipe:

male_population_0_14_years:
                      from: population
                      gender: male
                      age: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14']
female_population_0_14_years:
                      from: population
                      gender: female
                      age: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14']

There are a list of values in the age column after filtered. So it won't be dropped. gender will be dropped because it only have one value in filter. The age column will be removed later in the groupby procedure.

In fact step 1 is just like WHERE in SQL, but we don't need to SELECT the dimension columns. We say from: population and gender: male in recipe, it's just like this SQL:

SELECT `geo`, `year`, `age`, `gender`, `population` FROM table
  WHERE `gender`='male' ;

Besides, for now the filter can only test equality to one value or membership to list of values. Other operators like gt, lt , like is not supported but it should be not difficult to add these to the procedure.

semio / ddf_utils

recipe procedure: filter_row #2

Reason

API