Open semio opened 7 years ago
So, to have an overview.
One of the problems we have to solve is that we are joining two or more scopes of concepts that might be incompatible with one another. Right now the scope of procedures is always the ingredient. Two ingredients in one recipe could use the same concept (with a different definition/values etc) without a problem. Only when merging two ingredients to one, merging the two scopes could give problems.
Now, with our new plan, the scope of what a recipe modifies moves to a dataset. And it would be great if we could keep the scope of a recipe to one dataset as well. It keeps recipes simple. But in a recipe we want to join the scopes of different datasets to one. And to be able to join the scopes, transformations need to be done. To do those transformation we need to be able to work in the scopes of the dataset before joining. So... options are
I would like to move towards something similar as the datapackage pipelines, so we could use their development resources for our benefit. However, of course not at any cost, it should still make sense.
@semio how would your proposal enable streaming?
how would your proposal enable streaming?
I want to make one recipe has only one main pipeline, and in one pipeline, the procedures are processed one by one, and input/output of procedures are datasets. So datasets will be passed from one procedure to another. Chef will always start from the main pipeline, and when it need the output from other pipeline, it will open a new process and get the result from that pipeline. Sub pipelines can also be set like dependencies in the datapackage-pipeline, and stored in separate recipe files.
So the streaming is base on datasets in my proposal, not like the datapackage-pipelines, which can process resource row by row
In reply to https://github.com/semio/ddf_utils/issues/59#issuecomment-301554523
Ok, I see there are a few ways we can solve the scope problem. I think point 1 is most like the datapackage pipelines. In datapackage-pipelines, all pipelines are described in the pipeline-spec.yaml
file so there might be multiple pipelines in the file. And we can still do point 3, resolves conflicts in each step.
For point 2, I think the problem is that we can't manage the ontology dataset itself with recipe, which means that when we want to add some new data(concepts/entities) from other dataset to the ontology dataset, we can't use recipe because we don't have the ontology dataset that contains all information.
I would like to move towards something similar as the datapackage pipelines, so we could use their development resources for our benefit.
yes, that would be good, and I think it's possible to do that, because their input/output are also datasets with datapackage.json. options are:
@jheeffer I think point 3 is better option for us, because we want to make procedures base on datasets, but datapackage-pipeline is base on rows. Also datapackage-pipeline doesn't use the pydata packages (e.g numpy/pandas etc) at all, it might add a lot of work for us to fit in their system.
I will try to build a plugin for datapackage-pipelines, add some simple processors to it, see if it works: https://github.com/semio/datapackage_pipelines_ddf
UPDATE:
I dug into their source codes and tried to create processors for ddf, but I don't think it will be good to make our recipe in this way. The disadvantage are:
so my opinion is to continue with our chef module and try to learn the good things from them.
Alright! Thanks for the research @semio !
So, if I understand correctly you tried to see how our pipeline projects can merge. Which in essence means applying logic in gapminder procedures to datapackage processors.
The main problems are that:
We use Numpy/Pandas to load a full table/csv file in memory and can then perform operations on it in memory. DP pipelines has streaming iterators on rows so can only work on a row at a time.
We don't use the iterators provided in datapackage processors because of the above. Nonetheless we have to follow pipeline structure: i.e. spew
needs to get the updated resources in an iterator to be able to continue to the next processor. But our way of handling data kinda blocks the row-streaming (we're not returning streaming iterators, just iterators from the dataframe).
you cannot use interactive debugging. Though I don't have enough python experience to know enough about this, that seems annoying : )
So we choose to use more memory, for a faster processing it seems now?
I suppose it's possible to implement many of our procedures, applied on a dataset-scope, in a row-based manner. However, it would probably be a lot harder since you can't use all the yummyness in pandas.
I haven't gotten to a conclusion yet or a recipe format.
In the mean time you can maybe work on ddfSchema generation in python?
Yes, you are correct about the problems.
Indeed we can implement our procedures in row based way. And I guess we can improve the performance by using multithreading / multiprocessing.
On the debugging problem, you can see the implement of join or concatenate, which I think it would be easier to write if we stream base on table. They have many lines of codes, so when I tried to understand what is going on in the program, I want to stop the program at some point and see what's in the variables. That's why I want to do interactive debugging. But now I have to add many logging statements, so it's just like you can't inspect elements in the browser but only print things in the console.
When we face memory limits, we can consider dask
, which provides support for lazy loading big dataframes, and also task scheduling functions.
okay, I will switch to ddfSchema for now. Let me know when you have new ideas or questions :)
P.S. I also found xarray
for multidimensional datasets, which follows the netCDF data structure, which looks similar to DDF (though it doesn't support one indicator with different keys in one dataset). Might be useful for us.
goal for new recipe format:
proposal
on
ingredients
:we don't need key/value any more, so we just drop them and keep the id/dataset/row_filter
on
cooking
:take gapminder population as example, the pipeline process is as this image
We can see that, the pipeline of gm-pop and un-pop are independent pipelines, and the result of these pipelines are used in the main pipeline. So I suggest that we make these pipelines as objects in the
cooking
section. In each pipeline the data will streamed from one procedure to another, and final result will be available as ingredient dataset with name same as the pipeline name. If we want to reference to pipeline itself, usethis
ingredient.Furthermore, we can point pipeline object to a recipe file, which means run the recipe file to get pipeline result.
on
serve
:The result of last procedure of main_pipeline should be the final dataset, and we can use the
serve
procedure to set options for output format. But theserve
should be optional.Example (Note: the usage of the procedures are not fully discussed yet)