Proposal title

Analytics engineering without dbt? Building the composable Python data stack with Kedro and Ibis

Abstract

For the past decade, SQL has reigned king of the data transformation world, and tools like dbt have formed a cornerstone of the modern data stack. Until recently, Python-first alternatives couldn't compete with the scale and performance of modern SQL. However, now Ibis can provide the same benefits of SQL execution with a flexible Python dataframe API, and we can leverage it to build scalable Python pipelines in Kedro. In this tutorial, we will develop a simple analytics pipeline locally, then deploy it in a cloud data warehouse, with just a configuration change.

Description

Python has become the lingua franca of data science, and it's a great language for building AI/ML pipelines. However, in the data engineering world, it leaves much to be desired. A lot of data practitioners end up:

slurping up large amounts of data into memory, instead of pushing execution down to the underlying database/engine
implementing proof-of-concepts on data extracts, and then struggling massively when they need to migrate or rewrite their logic to run against the production databases and scale out
insisting on building data pipelines in Python for consistency (fair enough), when dbt would have been the much better fit for data engineering because they essentially needed a SQL workflow

In this session, we will first understand the motivation for a better solution for building production data pipelines in Python:

The dev-prod dilemma. Existing solutions excel in the PoC/development phase; however, deploying the same code in production doesn't work as well as one would hope.
The SQL solution. In spite of its drawbacks, SQL presents a standardized* programming language that's supported by every database (and many other compute frameworks).
"What if I don't like SQL?" In the end, there will always be people (like myself, and, I imagine, many other attendees at a major Python conference) who would rather use Python than SQL.

Then, we will implement a local solution using DuckDB and two popular open-source Python libraries:

Kedro for building data pipelines following software engineering best practices
Ibis for defining data transformations using a familiar dataframe API that get executed with the scale and performance of modern SQL

Last but not least, we will discuss other benefits of this solution, including the reusability and portability of the Ibis-based data pipelines and validations. To that end—with one simple configuration change—we will run the same pipeline at scale in Starburst Galaxy.

Notes

We chose Starburst Galaxy for the tutorial only because it is easy to create a free trial account and get started using it (for the purpose of demonstrating support for multiple backends and remote execution). Another platform that offers a free trial, like Google BigQuery, would be an equally-good option.

We have also recently published a blog post articulating how Kedro and Ibis can be used together.

Last but not least, only Deepyaman's name is included in the YouTube title, because "Deepyaman Datta, Juan Luis Cano Rodríguez, and Joel Schwarzmann" would be 63 characters alone.

ibis-project / talks

PyData London 2024 - Kedro-Ibis Tutorial #25

Proposal title

Abstract

Description

Notes