road86 / bahis-data

Repository for cleaning and adjusting BAHIS-related data
0 stars 0 forks source link

Decide on an ETL / data engineering pipeline framework #4

Closed ChasNelson1990 closed 1 year ago

ChasNelson1990 commented 1 year ago

Problem

Desired Solution

There are many tools out there for doing this. We should investigate them and decide which to use going forward, e.g. (off the top of my head):

Considered Alternatives

We can also just write this in Python and Pandas but one of these frameworks may be more useful for us?

Additional Context

No response

WaliZaman commented 1 year ago

Short-listed: Prefect: • Execute code and keep data secure in our existing infrastructure • Sub workflows possible • Python framework image

Luigi: • visual overview of the dependency graph of the workflow • Python package • sub workflows possible (there are packages) image

DBT: • can be written in .sql or .py files • DBT handles the chore of dependency management • allows returning to a previous state • transform the data where it lives • can run individual sub workflows

@mixmixmix @yokat These are the ETL / data engineering pipeline tools that I have short-listed based on my discussion with @ChasNelson1990. If you had any questions or opinions, could you post them here so I can look into them further.

ChasNelson1990 commented 1 year ago

@WaliZaman maybe you should explain the motivation at little more? Maybe share our whiteboard doodling?

WaliZaman commented 1 year ago

image

The reason for looking into these tools: • they will help in the management of the flows (making the computer manage it rather than a human) • dependencies are easier to handle

mixmixmix commented 1 year ago

I think we should also discuss what are our actual minimal requirements for this feature before commiting much effort to evaluate available tools https://github.com/road86/project-bahis/issues/157

mixmixmix commented 1 year ago

@WaliZaman my opinion would be to use whatever tool is the easiest for use and maintanance for the in-country team as it is a way to automate what would otherwise be custom python scripts.

WaliZaman commented 1 year ago

As the three shortlisted tools could not be run to successfully do the job, I will be shifting to custom python scripts.