python-bonobo / bonobo

Extract Transform Load for Python 3.5+
https://www.bonobo-project.org/
Apache License 2.0
1.59k stars 146 forks source link

feature request: aggregations! #304

Closed Sjoerd82 closed 5 years ago

Sjoerd82 commented 5 years ago

I've used Bonobo now for about a week, and I'm fairly new to Python (maybe a year or so of using it in my spare time). That said, I've over 10yrs of ETL experience, mostly using Informatica, so I've modelled my project in a similar way as if using Informatica PowerCenter.

For this I created generic data loaders ("source qualifiers") and table writers that work together with SQLAlchemy in order to keep the code small and reusable over a larger project. Something like this could perhaps be a nice Bonobo extension.

But my Number 1 missing thing was that I can't create an easy aggregation transformation! I need to have a blocking transformation which only continues when all records [or all records of a 'group', if the input is sorted] have been received. That way I can do aggregations (sum, avg etc.) on the data. Or, I can place it at the end of a graph to sign off a completed data load, do some housekeeping, etc.

That said, I think Bonobo has great potential, as most ETL tools are expensive and out of reach for smaller companies.

Sjoerd82 commented 5 years ago

I'm creating a set of core ETL transformations, including the aggregator. I'm still a learning Python programmer, code reviews and suggestions are highly treasured! Thanks in advance!

https://pypi.org/project/bonobo-trans/ https://github.com/Sjoerd82/bonobo_trans https://bonobo-trans.readthedocs.io/en/latest/

By the way, this is my first published open-source software, so I'm open to anything that will improve quality and usefulness!

Sjoerd82 commented 5 years ago

Closing issue as this feature should probably not be part of the 'bare-bones' Bonobo ETL.

hartym commented 5 years ago

I disagree, I think some parts of this may find a way to bonobo (not everything, but still). I have a but of trouble with time management band looking in details seriously take some time, I'd suggest to leave this issue open for now unless I do miss something or you'd really want to keep this in a separated package. Let me know.

Sjoerd82 commented 5 years ago

I feel that there is a lot to say for Bonobo-ETL to specialize in being a kick-ass ETL engine only. Which transformations to include in this is I guess a bit of a gray area. Say, the Filter (already included) is so basic and sits so close to the engine that it makes sense to include, but say, a Type2 SCD transformation (data warehousing) is such a rich/specialized transformation, I'd say it's better to ship it as a set of separate add-on transformations. Especially richer transformations may give the project dependencies which may slow it down / make it less lean.

My "bonobo-trans" package contains the following six transformations, all relatively "core":

The transformations are fairly rich in options, making them flexible to use in a lot of scenario's. They also rely on other libraries, namely Pandas and SQLAlchemy (and in the future perhaps Dask).

My opinion at this point would be that it's best to have a separate engine with a small number of basic transformations and a (or set of) separate libraries of ETL transformations. Others may be inspired to create transformations and libraries of their own or expand (subclassing??) on what's there already.

Just my 2 cents. I'm new to this project, so I'm leaving this totally up to you, either approach has pro's and con's.

Oh, and mind you, these transformations are fairly 'alpha', I'm in the process of migrating a small DWH project to use these transformations, and still finding bugs, just a heads up that they still need some love here and there ;-)