python-bonobo / bonobo

Extract Transform Load for Python 3.5+
https://www.bonobo-project.org/
Apache License 2.0
1.58k stars 143 forks source link

[FR] loop (sub-)chains #405

Open xunxky opened 3 years ago

xunxky commented 3 years ago

Thanks for submitting an issue!

Thanks for the generator based approach of bonobo. However, we hit several limitations with bonobo, most of them I could circumvent them. But the most recent seems to me like an important feature i have not seen in any ETL implemented natively yet.

  • If this is a feature request, please make sure you explain the context, the goal, and why it is something that would go into bonobo core. Drafting some bits of spec is a good idea too, even if it's very draft-y. We are processing json documents read from jsonlines files. These documents do have the following structure:
    
    {
    "name":"some name",
    "items":[
    {"value":"sub-document 1"},
    {"value":"sub-document 2"},
    {"value":"sub-document 3"}
    ]
    }

however we need to process the sub-documents from the items array/list. 
Imagine we do have a node which adds the date to the sub-document and another node adding an id based on the full documents name field and the position in the array/list. 

simply speaking we could and will do this: 
for i, v in enumerate(doc["items"]):
    doc["items"][i]["date"] = datetime.now()
    doc["items"][i]["id"] = doc["name"] + str(i)
but actually it would be much more valuable if we could separate the responsibilities into different nodes

the ultimate goal would be to be able to loop through the nodes in a chain based on the number of items in the document. 
Of course I am not talking about bonobo inspecting the data but offering a step-in step-out visitor pattern like approach to control looping  (more generally controlling the flow of a chain/node from a different nodes point of view) 

![chain](https://user-images.githubusercontent.com/4629525/121397639-cf763b00-c954-11eb-82fb-8c93351e8ef3.png)

digraph G { subgraph cluster { node [style=filled]; "add date" -> "add id"; label = "loop until split yields EOD"; color=blue } "document split" -> "add date" "add id" -> "document unsplit" }

xunxky commented 2 years ago

Hi, so a little feed back. I have implemented this (at least for our needs, incompatible with the bonobo "library") in sync and async. It only is a straight forward chain without any branching (but it actually could be nested). The more I look into this the more I believe the basic approach of assuming some "graph" is too academic.

please have a look at gstreamer where they are using sources and sinks to redirect data flow. In some instances (e.g. Grouping, Counting ... ) the sinks need to know when the last element has been sent so their adjacent source can emit the computed result.