gforsyth commented 11 months ago

Title

Intro to Ibis: blazing fast analytics with DuckDB, Polars, Snowflake, and more, from the comfort of your Python repl.

Description

Tabular data is ubiquitous, and pandas has been the de facto tool in Python for analyzing it. However, as data size scales, analysis using pandas may become untenable. Luckily, modern analytical databases (like DuckDB) are able to analyze this same tabular data, but perform orders-of-magnitude faster than pandas, all while using less memory. Many of these systems only provide a SQL interface though; something far different from pandas’ dataframe interface, requiring a rewrite of your analysis code.

This is where Ibis comes in. Ibis provides a common dataframe interface to many popular databases and analytics tools (BigQuery, Snowflake, Spark, DuckDB, …). This lets users analyze data using the same consistent API, regardless of which backend they’re using, and without ever having to learn SQL (but you can use SQL if you want to!). No more pains rewriting pandas code to something else when you run into performance issues; write your code once using Ibis and run it on any supported backend.

In this tutorial we’ll cover:

The basic operations of Ibis (select, filter, group_by, order_by, join, and aggregate), and how these operations may be composed to form more complicated queries.
How Ibis may be used on a number of different local and remote backend engines to execute the same queries on different systems.
The tradeoffs of different database engines, and recommendations for how to choose the best tool for the job.
How Ibis integrates into the larger Python data ecosystem, including tools like Scikit-Learn, Matplotlib, PyArrow, pandas, Altair, and VegaFusion.

This is a hands-on tutorial, with numerous examples to get your hands dirty. Participants should ideally have some experience using Python and pandas, but no SQL experience is necessary.

Tentative Schedule

0:00 - Intro and Setup “Going beyond pandas”

Get attendees up and running in a GitHub Codespace or on their laptops. A bit of motivation about the kinds of problems where Ibis can help, and a general survey of attendees to find out what their existing pain points and experiences are.

0:15 - Introduction to Ibis basics

A hands-on, follow-along notebook introducing the basic verbs of Ibis data analysis, (select, filter, group_by, order_by, and aggregate), with hands-on exercises throughout.

1:00 - Coffee Break (5 minutes that definitely takes 10 minutes)

1:10 - In-memory tables, joins, and data analysis

Building on the previous notebook, we'll explore how to join in-memory data (from a pandas DataFrame, Python dictionary, or PyArrow Table) with existing tables in a local database and continue analysis on the join result.

We'll touch on the Ibis deferred operator for specifying predicates in chained joins, and demonstrate read_parquet and other read_* methods for loading local data into existing databases.

Then we'll continue with a series of hands-on exercises, building up an analysis pipeline for some IMDB ratings data, but only operating on a 5% sample of the original dataset.

After, we show how the same expression can be computed on the full dataset without any code changes, both for local execution, or with bursting to a cloud database (or other hosted database).

2:00 - Coffee Break (5-10 minutes) + Q&A in the room

2:10 - Selectors

Continuing on from joins, we'll introduce selectors as a means of quickly renaming and cleaning datasets, a powerful feature ~stolen~ inspired by dplyr.

2:25 - UDFs and sql passthrough

Demonstrate using UDFs to add custom operations.

Explain and demonstrate various "escape hatches" when you really need to use SQL directly.

2:45 - PyPI data exploration and integration with the broader Python ecosystem

Demonstate projection pushdown and column pruning when operating on remote datasets. Explore questions about PyPI maintainers, search for typo-squatters, and try to find explanations for outliers using data from https://py-code.org/datasets.

Feed Ibis expressions into common plotting tools to look for outliers and demonstrate interoperability.

(Note: depending on conference wifi, even with column pruning and parquet files this may be untenable. We have backup exercises that perform the same analysis but make use of either the Clickhouse Playground or a sponsored SnowFlake account, so only basic internet connectivity will be required. Bonus: using Ibis means shifting to these backup options is a one-line operation!)

3:15- Continued PyPI data analysis examples and open Q&A

3:25 - Wrap-up

Past Experience

We have given versions of this tutorial (although shorter) at EuroSciPy 2023 and PyData NYC 2023 https://github.com/ibis-project/ibis-tutorial

Gil Forsyth

I'm an experienced instructor, having led tutorials at several PyData conferences, PyCon, and SciPy. I also ran internal training on Python and distributed data analysis at Capital One for several years. I am one of the core maintainers of Ibis.

Tutorials

PyData NYC 2023: https://nyc2023.pydata.org/cfp/talk/UURZZR/
EuroSciPy 2023: https://youtu.be/tkejUD5Uq40
PyCon 2019: https://us.pycon.org/2019/schedule/presentation/366/
SciPy 2017 Numba Tutorial: https://www.youtube.com/watch?v=1AwG0T4gaO0
- https://github.com/gforsyth/numba_tutorial_scipy2017
SciPy 2016 Numba Tutorial: https://www.youtube.com/watch?v=SzBi3xdEF2Y

Talks

Expressive Analytics in Python at any scale - PyData NYC 2022
- https://www.youtube.com/watch?v=XdZklxTbCEA
Because SQL is everywhere but you don't want to use it - PyData Seattle 2022
- https://www.youtube.com/watch?v=en7sY3XKFk0

Jim Crist-Harif

Jim is also a core maintainer of Ibis and one of the original contributors and long-time maintainer of Dask. He has presented many talks and tutorials, links are available on his website: https://jcristharif.com/talks.html

cpcloud commented 11 months ago

Phillip Cloud

Please summarize your teaching or public speaking experience

I have been speaking in public in the software industry since around 2015, all of which has been about Python and analytics.

In a past life I taught undergraduate statistics to psychology students, as well as experimental design.

Tell us what experience you have in the subject

In addition to working full time on Ibis, I've given a large number of talks on it, nearly all of them public. Here are a few places you can see what I've done:

Ibis @ Trino Fest: https://youtu.be/JMUtPl-cMRc
Live stream of an introduction to Ibis: https://www.youtube.com/live/rMeDeSNY8yI
EuroSciPy 2023: https://youtu.be/-p6SRufakjI
My YouTube channel with a large variety of Ibis content: https://www.youtube.com/@cpcloud

ncclementi commented 11 months ago

How about this - feel free to modify it to fit the structure. It reads a bit weird

Please summarize your teaching or public speaking experience

"Naty Clementi is an experienced educator. She has presented tutorials at meetups and conferences such as SciPy, PyData NYC, and Women Who Code DC. In addition, she has taught multiple (unrecorded) Python courses to graduate and undergraduate students at the George Washington University."

Recent Tutorials:

Advanced Dask Tutorial (Scipy 2023): https://www.youtube.com/watch?v=ZMwpK6KVj3o
Dask Futures Tutorial (recurring online until 07/23): recordings at https://www.youtube.com/watch?v=32w33L7hseQ
Dask Dataframes Tutorial (recurring online until 07/23): https://www.youtube.com/watch?v=8bd7DswSxw4
How to contribute to open source (Women Who Code DC - 2022): https://www.youtube.com/watch?v=eAIYrnguV8c
Intro to Dask (Women Who Code DC - 2021): https://www.youtube.com/watch?v=ZKw7PdoS7YA

Recent Talks:

Open Source meets Enterprise: The right way (PyData Seattle 2023): https://www.youtube.com/watch?v=hSyLEuNrU5Y

Tell us what experience you have in the subject (not sure how to say I'm quite new to it, feel free to edit this as you see fit.)

Naty has been actively contributing and working full time on Ibis since October 2023

ncclementi commented 5 months ago

Closing as this was delivered. Great job everyone 🎉

ibis-project / talks

PyCon 2024: Tutorial #8