Closed semio closed 7 years ago
Aaaaah, I think I had a different idea about filter_row than is described here...
I thought of it as a WHERE
statement in a SQL query. That filter decides which rows from a dataframe will be in the result.
It seems you're doing something else though, can you describe the functionality more, maybe with an example (also for the sake of documentation)? (Including the dropping of columns you described in #14 )
Sure, let me explain the details. Take the cred dataset for example:
- procedure: filter_row
ingredients: [cred-datapoints-aligned]
options:
dictionary:
drought_affected_annual_number:
from: affected
disaster: drought
earthquake_affected_annual_number:
from: affected
disaster: earthquake
result: cred-datapoints-filtered
The ingredient, cred-datapoints-aligned
have key: "geo,disaster,year" so the table is like this:
geo | year | disaster | affected |
---|---|---|---|
usa | 1990 | drought | 100 |
usa | 1991 | drought | 200 |
usa | 1990 | earthquake | 5000 |
usa | 1991 | earthquake | 2000 |
chn | 1990 | drought | 50 |
chn | 1991 | drought | 150 |
chn | 1990 | earthquake | 3000 |
chn | 1991 | earthquake | 10000 |
What we want to get is: (and also below is what will return with the procedure)
geo | year | drought_affected_annual_number | earthquake_affected_annual_number |
---|---|---|---|
usa | 1990 | 100 | 5000 |
usa | 1991 | 200 | 2000 |
chn | 1990 | 50 | 3000 |
chn | 1991 | 150 | 1000 |
So the procedure will do the following steps:
1) created 2 new columns drought_affected_annual_number
and earthquake_affected_annual_number
:
drought_affected_annual_number
is filtered from affected
column in ingredient, where disaster == 'drought'
earthquake_affected_annual_number
is filtered from affected
column in ingredient, where disaster == 'earthquake'
The result will be
geo | year | disaster | drought_affected_annual_number | earthquake_affected_annual_number |
---|---|---|---|---|
usa | 1990 | drought | 100 | |
usa | 1991 | drought | 200 | |
chn | 1990 | drought | 50 | |
chn | 1991 | drought | 150 | |
usa | 1990 | earthquake | 5000 | |
usa | 1991 | earthquake | 2000 | |
chn | 1990 | earthquake | 3000 | |
chn | 1991 | earthquake | 1000 |
2) Consider the disaster
column for each of drought_affected_annual_number
and earthquake_affected_annual_number
. Because they have unique values (drought for all drought_affected_annual_number
datapoints), the procedure will drop this column. It won't drop them if there are multiple values in it.
A use case where there are multiple values in the column being filtered is the unpop recipe:
male_population_0_14_years:
from: population
gender: male
age: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14']
female_population_0_14_years:
from: population
gender: female
age: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14']
There are a list of values in the age
column after filtered. So it won't be dropped. gender
will be dropped because it only have one value in filter. The age
column will be removed later in the groupby
procedure.
In fact step 1 is just like WHERE
in SQL, but we don't need to SELECT
the dimension columns. We say from: population
and gender: male
in recipe, it's just like this SQL:
SELECT `geo`, `year`, `age`, `gender`, `population` FROM table
WHERE `gender`='male' ;
Besides, for now the filter can only test equality to one value or membership to list of values. Other operators like gt
, lt
, like
is not supported but it should be not difficult to add these to the procedure.
Reason
Sometimes we want to use a part of a datapoints in a concept as a new concept. For example, we have total deaths caused by storm, flood, earthquake... in one concept, and we want to make a new concept call total_deaths_storm.
API
the
filter_row
procedure filter an ingredient based on a set of options and return the filtered result as a new ingredient.filter_row procedure accepts following options:
ingredients
: ingredients to filterdictionary
: a list of dictionaries, each dict have following format: