wish list - Githubissues

semio commented 4 years ago

ideally user should only import ddf_utils/pandas to create a DDF dataset
- easy way to import csv and convert it to types in DDF model (Entity/DataPoint), and save them in file with correct file names
reader for Gapminder spreadsheets
chef: support downloading specific version of a dataset
chef: add @procedure wrapper to make it easier to create custom procedure
ddf build: build dataset (load new source, dependencies and run etl script) in one command

jheeffer commented 4 years ago

With import, you mean when writing a python script right? Not when executing a command?

With reader, are you referring to the spreadsheets of fasttrack? How is that different from current fasttrack code?

Other things sound good : )!

jheeffer commented 4 years ago

Another feature to think about; not sure if it's for chef or some other solution or if it's even feasible given current set up.

Given an indicator, retrace where it came from originally. I.e. draw the path through the dataset tree. With some of our procedures that might be quite difficult on a detailed level. Maybe it's quite doable on a high level (e.g. finding out that mcv_immunized_percent_of_one_year_olds in SG comes from gapminder world dataset).

semio commented 4 years ago

With import, you mean when writing a python script right? Not when executing a command?

Right, I am thinking about writing a python script. I'd like to write scripts like this:

from ddf_utils.model.ddf import DDF, Concept, EntityDomain, DataPoints

source_file = '../source/some_file.csv'

def extract_concepts(df) -> [Concept]:
    # process to extract concepts...
    return concepts  # type: list of Concept objects

def extract_entities(df) -> [EntityDomain]:
    # process to extract entity domains
    return entity_domains  # type list of EntityDomain objects

def extract_datapoints(df) -> [Datapoints]:
    # process to extract datatpoints
    return datapoints  # type: list of DataPoints objects

def main():
    df = pd.read_csv(source_file)
    concepts = extract_concepts(df)
    domains = extract_entities(df)
    datapoints = extract_datapoints(df)

    ddf = DDF(concepts=concepts, domains=domains, datapoints=datapoints)
    ddf.to_csv('output/dir')

In the extract functions, we would use functions in pandas/ddf_utils to extract data from / transform the dataframe. So it would be more like recipes, where we have processes for datapoints/concepts/entities.

semio commented 4 years ago

With reader, are you referring to the spreadsheets of fasttrack? How is that different from current fasttrack code?

Yes, if we only need to support the current fasttrack format, then it won't be much different. But if we want to have multiple datasets or support different formats, making a library should help.

semio commented 4 years ago

retrace where it came from originally

Right, I think it's not easy for some procedures. For example run_op, we need to parse the operation strings (e.g. "co2_emissions / population * 1000") to get 2 base indicators, and then, the co2_per_capita indicator will have 2 parent datasets.

I think we will have to run the recipe once to build a graph. Procedures should inspect its input and output and modify the graph. We can cache this graph somewhere in the etl/ folder and speed up next queries.

semio / ddf_utils

wish list #123