Open semio opened 4 years ago
With import, you mean when writing a python script right? Not when executing a command?
With reader, are you referring to the spreadsheets of fasttrack? How is that different from current fasttrack code?
Other things sound good : )!
Another feature to think about; not sure if it's for chef or some other solution or if it's even feasible given current set up.
Given an indicator, retrace where it came from originally. I.e. draw the path through the dataset tree. With some of our procedures that might be quite difficult on a detailed level. Maybe it's quite doable on a high level (e.g. finding out that mcv_immunized_percent_of_one_year_olds
in SG comes from gapminder world dataset).
With import, you mean when writing a python script right? Not when executing a command?
Right, I am thinking about writing a python script. I'd like to write scripts like this:
from ddf_utils.model.ddf import DDF, Concept, EntityDomain, DataPoints
source_file = '../source/some_file.csv'
def extract_concepts(df) -> [Concept]:
# process to extract concepts...
return concepts # type: list of Concept objects
def extract_entities(df) -> [EntityDomain]:
# process to extract entity domains
return entity_domains # type list of EntityDomain objects
def extract_datapoints(df) -> [Datapoints]:
# process to extract datatpoints
return datapoints # type: list of DataPoints objects
def main():
df = pd.read_csv(source_file)
concepts = extract_concepts(df)
domains = extract_entities(df)
datapoints = extract_datapoints(df)
ddf = DDF(concepts=concepts, domains=domains, datapoints=datapoints)
ddf.to_csv('output/dir')
In the extract functions, we would use functions in pandas/ddf_utils to extract data from / transform the dataframe. So it would be more like recipes, where we have processes for datapoints/concepts/entities.
With reader, are you referring to the spreadsheets of fasttrack? How is that different from current fasttrack code?
Yes, if we only need to support the current fasttrack format, then it won't be much different. But if we want to have multiple datasets or support different formats, making a library should help.
retrace where it came from originally
Right, I think it's not easy for some procedures. For example run_op, we need to parse the operation strings (e.g. "co2_emissions / population * 1000") to get 2 base indicators, and then, the co2_per_capita indicator will have 2 parent datasets.
I think we will have to run the recipe once to build a graph. Procedures should inspect its input and output and modify the graph. We can cache this graph somewhere in the etl/ folder and speed up next queries.
@procedure
wrapper to make it easier to create custom procedureddf build
: build dataset (load new source, dependencies and run etl script) in one command