semio / ddf_utils

Utilities for working with DDF datasets
https://open-numbers.github.io/
MIT License
2 stars 1 forks source link

new recipe format #59

Open semio opened 7 years ago

semio commented 7 years ago

goal for new recipe format:

  1. operate on dataset level
  2. use a streaming structure

proposal

on ingredients:

we don't need key/value any more, so we just drop them and keep the id/dataset/row_filter

on cooking:

take gapminder population as example, the pipeline process is as this image

We can see that, the pipeline of gm-pop and un-pop are independent pipelines, and the result of these pipelines are used in the main pipeline. So I suggest that we make these pipelines as objects in the cooking section. In each pipeline the data will streamed from one procedure to another, and final result will be available as ingredient dataset with name same as the pipeline name. If we want to reference to pipeline itself, use this ingredient.

Furthermore, we can point pipeline object to a recipe file, which means run the recipe file to get pipeline result.

on serve:

The result of last procedure of main_pipeline should be the final dataset, and we can use the serve procedure to set options for output format. But the serve should be optional.

Example (Note: the usage of the procedures are not fully discussed yet)

info:
    # some metadatas
    dataset: my-demo-dataset
    author: semio <prairy.long@gmail.com>
    license: MIT

config:
    # configurations
    datasets_dir: ./

ingredients:
    - id: gm-pop
      path: ddf--gapminder--population_historic
    - id: gm-geo
      path: ddf--gapminder--geo_entity_domain
    - id: un-pop
      path: ddf--unpop--wpp_population

cooking:
    unpop_pipeline:
        - procedure: run_op
          options:
              ingredient: un-pop
              op:
                  population: population * 1000
        # align all geo to gapminder's geo domain
        - procedure: translate_value
          options:
              ingredient: this
              concept: country_code
              dictionary: ...  # skip dictionary

    gmpop_pipeline:
        # align to gapminder's geo domain
        - prcedure: translate_value
          options:
              ingredient: gm-pop
              concept: country
              dictionary: ...  # skip dictionary

    main_pipeline:
        - procedure: add_datapoints
          options:
              ingredient: unpop_pipeline  # ref to unpop_pipeline output
              concepts:
                  - concept: population
                    by: "*"
        - procedure: add_entities
          options:
              ingredient: gm-geo
              domain: geo
        - procedure: join
          options:
              source: 
                  ingredient: unpop_pipeline
                  key: ["country_code", "year"]
              target:
                  ingredient: this   # ref to data in current pipeline
                  key: null
              fields:
                  country:
                        name: country_code
                  year:
                        name: year
                  population:
                      name: population
                      aggregate: sum

        # skip other contents...
jheeffer commented 7 years ago

So, to have an overview.

One of the problems we have to solve is that we are joining two or more scopes of concepts that might be incompatible with one another. Right now the scope of procedures is always the ingredient. Two ingredients in one recipe could use the same concept (with a different definition/values etc) without a problem. Only when merging two ingredients to one, merging the two scopes could give problems.

Now, with our new plan, the scope of what a recipe modifies moves to a dataset. And it would be great if we could keep the scope of a recipe to one dataset as well. It keeps recipes simple. But in a recipe we want to join the scopes of different datasets to one. And to be able to join the scopes, transformations need to be done. To do those transformation we need to be able to work in the scopes of the dataset before joining. So... options are

  1. Have multiple scopes in one recipe, for example like you did with the multiple pipelines.
  2. Have recipes to move a dataset to a non-conflicting scope. I.e. to the gapminder ontology (entity domains and concepts). Then join the result of those scope-compatible datasets.
  3. Do transformation that resolves conflicts in the same procedure as adding data to the new dataset, to prevent conflicts while practically keeping one scope in the recipe.

I would like to move towards something similar as the datapackage pipelines, so we could use their development resources for our benefit. However, of course not at any cost, it should still make sense.

jheeffer commented 7 years ago

@semio how would your proposal enable streaming?

semio commented 7 years ago

how would your proposal enable streaming?

I want to make one recipe has only one main pipeline, and in one pipeline, the procedures are processed one by one, and input/output of procedures are datasets. So datasets will be passed from one procedure to another. Chef will always start from the main pipeline, and when it need the output from other pipeline, it will open a new process and get the result from that pipeline. Sub pipelines can also be set like dependencies in the datapackage-pipeline, and stored in separate recipe files.

So the streaming is base on datasets in my proposal, not like the datapackage-pipelines, which can process resource row by row

semio commented 7 years ago

In reply to https://github.com/semio/ddf_utils/issues/59#issuecomment-301554523

Ok, I see there are a few ways we can solve the scope problem. I think point 1 is most like the datapackage pipelines. In datapackage-pipelines, all pipelines are described in the pipeline-spec.yaml file so there might be multiple pipelines in the file. And we can still do point 3, resolves conflicts in each step.

For point 2, I think the problem is that we can't manage the ontology dataset itself with recipe, which means that when we want to add some new data(concepts/entities) from other dataset to the ontology dataset, we can't use recipe because we don't have the ontology dataset that contains all information.

I would like to move towards something similar as the datapackage pipelines, so we could use their development resources for our benefit.

yes, that would be good, and I think it's possible to do that, because their input/output are also datasets with datapackage.json. options are:

  1. use their api to build custom functions for ddf datasets. in this case, we should adapt the recipe format to their syntax
  2. to use our own syntax, ~we can change the specs and parser in their source code, and then using the api to build ddf dataset functions~ datapackage-pipeline have a plugin system which allow us to create pipeline templates, which means we can have custom syntax
  3. have our own recipe runner and syntax, and borrow their web server/cache mechanism
semio commented 7 years ago

@jheeffer I think point 3 is better option for us, because we want to make procedures base on datasets, but datapackage-pipeline is base on rows. Also datapackage-pipeline doesn't use the pydata packages (e.g numpy/pandas etc) at all, it might add a lot of work for us to fit in their system.

I will try to build a plugin for datapackage-pipelines, add some simple processors to it, see if it works: https://github.com/semio/datapackage_pipelines_ddf

UPDATE:

I dug into their source codes and tried to create processors for ddf, but I don't think it will be good to make our recipe in this way. The disadvantage are:

so my opinion is to continue with our chef module and try to learn the good things from them.

jheeffer commented 7 years ago

Alright! Thanks for the research @semio !

So, if I understand correctly you tried to see how our pipeline projects can merge. Which in essence means applying logic in gapminder procedures to datapackage processors.

The main problems are that:

  1. We use Numpy/Pandas to load a full table/csv file in memory and can then perform operations on it in memory. DP pipelines has streaming iterators on rows so can only work on a row at a time.

    1. If we want to keep using the benefits of pandas dataframes, we can't use the per-row streaming function of datapackage-pipelines, at best streaming per file. Which might be fine, since per file means we can perform operations a lot faster (though use more memory).
  2. We don't use the iterators provided in datapackage processors because of the above. Nonetheless we have to follow pipeline structure: i.e. spew needs to get the updated resources in an iterator to be able to continue to the next processor. But our way of handling data kinda blocks the row-streaming (we're not returning streaming iterators, just iterators from the dataframe).

  3. you cannot use interactive debugging. Though I don't have enough python experience to know enough about this, that seems annoying : )

So we choose to use more memory, for a faster processing it seems now?

I suppose it's possible to implement many of our procedures, applied on a dataset-scope, in a row-based manner. However, it would probably be a lot harder since you can't use all the yummyness in pandas.

I haven't gotten to a conclusion yet or a recipe format.

In the mean time you can maybe work on ddfSchema generation in python?

semio commented 7 years ago

Yes, you are correct about the problems.

Indeed we can implement our procedures in row based way. And I guess we can improve the performance by using multithreading / multiprocessing.

On the debugging problem, you can see the implement of join or concatenate, which I think it would be easier to write if we stream base on table. They have many lines of codes, so when I tried to understand what is going on in the program, I want to stop the program at some point and see what's in the variables. That's why I want to do interactive debugging. But now I have to add many logging statements, so it's just like you can't inspect elements in the browser but only print things in the console.

When we face memory limits, we can consider dask, which provides support for lazy loading big dataframes, and also task scheduling functions.

okay, I will switch to ddfSchema for now. Let me know when you have new ideas or questions :)

P.S. I also found xarray for multidimensional datasets, which follows the netCDF data structure, which looks similar to DDF (though it doesn't support one indicator with different keys in one dataset). Might be useful for us.