Open gforsyth opened 3 months ago
Ibis: Don't Let the Engine Dictate the Interface
Tabular data is ubiquitous, and pandas has been the de facto tool in Python for analyzing it. However, as data size scales, analysis using pandas may become untenable. Luckily, modern analytical databases (like DuckDB) are able to analyze this same tabular data, but perform orders-of-magnitude faster than pandas, all while using less memory. Many of these systems only provide a SQL interface though; something far different from pandas’ dataframe interface, requiring a rewrite of your analysis code. This talk will lay out the current database / data landscape as it relates to the PyData stack, and explore how Ibis (an open-source, pure Python, dataframe interface library) can help decouple interfaces from engines, to improve both performance and portability. We'll examine other solutions for interacting with SQL from Python and discuss some of their strengths and weaknesses.
Ibis is an open-source, pure Python library that lets you write Python to build up expressions that can be executed on a wide array of backends / execution engines (SQLite, DuckDB, Postgres, Spark, Clickhouse, Snowflake, BigQuery, and more!).
Modern analytical databases (like DuckDB) are able to analyze tabular data orders-of-magnitude faster than pandas, all while using less memory.
pandas and other Python libraries can interact with databases, but they were not designed to do so efficiently. Pulling ALL of the data to your local machine to then perform a reduction or aggregation is only tractable for very small problems.
Treating a remote database as a data store isn’t wrong, but it provides an incomplete view of everything these systems can offer.
Because they are very, very fast. 50 years of database research hasn't gone to waste - modern execution engines perform all kinds of optimizations to deliver results quickly.
In a cruel twist of fate, though, almost all of them require you to write SQL in order to use them. SQL is not an ideal tool for exploratory data analysis. If you know exactly how to answer the question in front of you, then SQL is probably (possibly) fine. But you don't always know how - that's part of what the exploration is.
SQL is only a language – it’s an interface. The execution engine is a separate thing. Historically the interface and the engine have been very tightly coupled, but they don’t have to be.
Maybe you would like to use the DuckDB execution engine, but you don’t like the interface (SQL)?
Or you would like to use the Spark execution engine, but you don’t like the interface (PySpark API)?
The interface shouldn’t be a hurdle for a user to clear in order to make use of the available tools. In the scientific Python community, SQL, in particular, is a hurdle that many users have turned away from. Ibis provides a consistent, Pythonic, and intuitive interface to interact with execution engines, even when their only “advertised” interface is SQL.
+1-ing this, since I've seen the talk and it's amazing.
I submitted a talk too, I adapted the geospatial one a bit.
Ibis, DuckDB, and GeoParquet: Making Geospatial Analytics Fast, Simple, and Pythonic
Geospatial data is becoming increasingly integral to data workflows, and Python offers a wide array of tools to handle it. A powerful new option has recently emerged: DuckDB, which now supports geospatial analytics with its new extension. DuckDB has taken the data world by storm (~23k stars on GitHub) and is making waves in geospatial data too. Plus, with the increasing developments and adoption of GeoParquet, storing and exchanging geospatial data has never been easier. But what if you prefer writing Python code over SQL? That’s where Ibis comes in. Ibis is a Python library that provides a dataframe-like interface, allowing you to write Python code to construct SQL expressions that can be executed on various backends, including DuckDB.
In this talk, I’ll demonstrate how to leverage the power of DuckDB’s spatial capabilities while staying within the Python ecosystem—yes, there will be a live demo! (Pssst... I’ll show you how to work with GeoParquet data from Overture Maps, create nice plots that won’t kill your laptop, and avoid SQL.) This is an introductory talk; everyone is welcome, and no prior experience with spatial databases or geospatial workflows is needed.
Ibis is an open-source Python library that provides a dataframe-like API, enabling you to write Python code to build expressions that can be executed across multiple backends such as DuckDB, PostgreSQL, BigQuery, and more. Some of these backends offer support for geospatial operations that can be executed via Ibis without the need to write any SQL. In this talk, we aim to showcase our default backend: DuckDB.
Over the past year, DuckDB has introduced support for over 100 geospatial operations, many of which are now accessible via Ibis. This allows you to experiment with these operations while remaining in Python land. If you have experience working with spatial databases, you are likely familiar with PostGIS, a library that extends PostgreSQL's capabilities to handle geospatial data. The DuckDB spatial extension provides a healthy subset of PostGIS-like options, but getting started is much simpler. No server-side setup, user configuration, or client configuration. DuckDB seamlessly integrates into existing GIS workflows, regardless of data formats or projections. Recently, DuckDB has also added support for GeoParquet. GeoParquet extends the powerful Apache Parquet columnar data format to the geospatial domain, making it easier to work with geospatial data in a high-performance, columnar format.
With Ibis, performing your first spatial operations becomes even easier and, most importantly, it’s Python! During this talk, we will introduce Ibis and demonstrate its geospatial functionality through an example, with DuckDBas backend and working with a GeoParquet data source. We will also explore compatibility with other Python libraries such as GeoPandas, and lonboard for plotting purposes. By the end of the talk, you’ll learn how to get started with Ibis and work with spatial databases with DuckDB as a backend engine.
Building machine learning pipelines that scale: a case study using Ibis and IbisML
Tutorial (90 minutes)
Libraries like Ibis have been gaining traction recently, by unifying the way we work with data across multiple data platforms—from dataframe APIs to databases, from dev to prod. What if we could extend the abstraction to machine learning workflows (broadly, sequences of steps that implement fit
and transform
methods)? In this tutorial, we will develop an end-to-end machine learning project to predict the live win probability at any given move during a chess game.
As Python has become the lingua franca of data science, pandas and scikit-learn have cemented their roles in the standard machine learning toolkit. However, when data volumes rise, this stack becomes unwieldy (requiring proportionately-larger compute, subsampling to reduce data size, or both) or altogether untenable.
Luckily, modern analytical databases (like DuckDB) and dataframe libraries (such as Polars) can crunch this same tabular data, but perform orders-of-magnitude faster than pandas, all while using less memory. Ibis already provides a unified dataframe API that lets users leverage a plethora of popular databases and analytics tools (BigQuery, Snowflake, Spark, DuckDB, etc.) without rewriting their data engineering code. However, at scale, the performance bottleneck is pushed to the ML pipeline.
IbisML extends the intrinsic benefits of using Ibis to the ML workflow. It lets you bring your ML to the database (or other Ibis-supported backend), and supports efficient integration with modeling frameworks like XGBoost, PyTorch, and scikit-learn. On top of that, IbisML steps can be used as estimators within the familiar context of scikit-learn pipelines.
In this tutorial, we'll cover:
Step
s and Recipe
s, and how we can combine them to process features before passing them to our live win probability model.This is a hands-on tutorial, and you will train a simple (not great!) live win probability model on a provided dataset. You'll also see how the result can be run at scale on a distributed backend. Participants should ideally have some experience using Python dataframe libraries; scikit-learn or other modeling framework familiarity is helpful but not required.
My talk: Ibis, DuckDB, and GeoParquet: Making Geospatial Analytics Fast, Simple, and Pythonic got accepted. Leaving this here for tracking purposes.
https://pydata.org/nyc2024/call-for-proposals