Closed ChasNelson1990 closed 1 year ago
Short-listed: Prefect: • Execute code and keep data secure in our existing infrastructure • Sub workflows possible • Python framework
Luigi: • visual overview of the dependency graph of the workflow • Python package • sub workflows possible (there are packages)
DBT: • can be written in .sql or .py files • DBT handles the chore of dependency management • allows returning to a previous state • transform the data where it lives • can run individual sub workflows
@mixmixmix @yokat These are the ETL / data engineering pipeline tools that I have short-listed based on my discussion with @ChasNelson1990. If you had any questions or opinions, could you post them here so I can look into them further.
@WaliZaman maybe you should explain the motivation at little more? Maybe share our whiteboard doodling?
The reason for looking into these tools: • they will help in the management of the flows (making the computer manage it rather than a human) • dependencies are easier to handle
I think we should also discuss what are our actual minimal requirements for this feature before commiting much effort to evaluate available tools https://github.com/road86/project-bahis/issues/157
@WaliZaman my opinion would be to use whatever tool is the easiest for use and maintanance for the in-country team as it is a way to automate what would otherwise be custom python scripts.
As the three shortlisted tools could not be run to successfully do the job, I will be shifting to custom python scripts.
Problem
Desired Solution
There are many tools out there for doing this. We should investigate them and decide which to use going forward, e.g. (off the top of my head):
Considered Alternatives
We can also just write this in Python and Pandas but one of these frameworks may be more useful for us?
Additional Context
No response