ibis-project / talks

3 stars 2 forks source link

SciPy 2024 - Ibis: because SQL is everywhere and so is Python #23

Open gforsyth opened 4 months ago

gforsyth commented 4 months ago

Abstract

Tabular data is ubiquitous, and pandas has been the de facto tool in Python for analyzing it. However, as data size scales, analysis using pandas may become untenable. Luckily, modern analytical databases (like DuckDB) are able to analyze this same tabular data, but perform orders-of-magnitude faster than pandas, all while using less memory. Many of these systems only provide a SQL interface though; something far different from pandas’ dataframe interface, requiring a rewrite of your analysis code.

This talk will lay out the current database / data landscape as it relates to the SciPy stack, and explore how Ibis (an open-source, pure Python, dataframe interface library) can help decouple interfaces from engines, to improve both performance and portability. We'll examine other solutions for interacting with SQL from Python and discuss some of their strengths and weaknesses.

Description

Ibis is an open-source, pure Python library that lets you write Python to build up expressions that can be executed on a wide array of backends / execution engines (SQLite, DuckDB, Postgres, Spark, Clickhouse, Snowflake, BigQuery, and more!).

Modern analytical databases (like DuckDB) are able to analyze tabular data orders-of-magnitude faster than pandas, all while using less memory.

pandas and other Python libraries can interact with databases, but they were not designed to do so efficiently. Pulling ALL of the data to your local machine to then perform a reduction or aggregation is only tractable for very small problems.

Treating a remote database as a data store isn’t wrong, but it provides an incomplete view of everything these systems can offer.

Because they are very, very fast. 50 years of database research hasn't gone to waste - modern execution engines perform all kinds of optimizations to deliver results quickly.

In a cruel twist of fate, though, almost all of them require you to write SQL in order to use them. SQL is not an ideal tool for exploratory data analysis. If you know exactly how to answer the question in front of you, then SQL is probably (possibly) fine. But you don't always know how - that's part of what the exploration is.

SQL is only a language – it’s an interface. The execution engine is a separate thing. Historically the interface and the engine have been very tightly coupled, but they don’t have to be.

Maybe you would like to use the DuckDB execution engine, but you don’t like the interface (SQL)?

Or you would like to use the Spark execution engine, but you don’t like the interface (PySpark API)?

The interface shouldn’t be a hurdle for a user to clear in order to make use of the available tools. In the scientific Python community, SQL, in particular, is a hurdle that many users have turned away from. Ibis provides a consistent, Pythonic, and intuitive interface to interact with execution engines, even when their only “advertised” interface is SQL.

Additional Info

Source code: https://github.com/ibis-project/ibis Docs: https://ibis-project.org Ibis blog: https://ibis-project.org/posts

Previous Presentations

Tutorials

Talks

ncclementi commented 4 months ago

@gforsyth make sure you check the guidelines, as it looks like now they want a the description that is more worked out.

Description

Your placement in the program will be based on reviews of your description. This should be a roughly 500-word outline of your presentation. This outline should concisely describe software of interest to the SciPy community, tools or techniques for more effective computing, or how scientific Python was applied to solve a research problem. A traditional background/motivation, methods, results, and conclusion structure is encouraged but not required. Links to project websites, source code repositories, figures, full papers, and evidence of public speaking ability are encouraged.

They are also asking for links, blogpost, source code, etc (see tips section)

https://cfp.scipy.org/2024/cfp