Closed gforsyth closed 5 months ago
Please summarize your teaching or public speaking experience
I have been speaking in public in the software industry since around 2015, all of which has been about Python and analytics.
In a past life I taught undergraduate statistics to psychology students, as well as experimental design.
Tell us what experience you have in the subject
In addition to working full time on Ibis, I've given a large number of talks on it, nearly all of them public. Here are a few places you can see what I've done:
How about this - feel free to modify it to fit the structure. It reads a bit weird
Please summarize your teaching or public speaking experience
"Naty Clementi is an experienced educator. She has presented tutorials at meetups and conferences such as SciPy, PyData NYC, and Women Who Code DC. In addition, she has taught multiple (unrecorded) Python courses to graduate and undergraduate students at the George Washington University."
Recent Tutorials:
Recent Talks:
Tell us what experience you have in the subject (not sure how to say I'm quite new to it, feel free to edit this as you see fit.)
Naty has been actively contributing and working full time on Ibis since October 2023
Closing as this was delivered. Great job everyone 🎉
Title
Intro to Ibis: blazing fast analytics with DuckDB, Polars, Snowflake, and more, from the comfort of your Python repl.
Description
Tabular data is ubiquitous, and pandas has been the de facto tool in Python for analyzing it. However, as data size scales, analysis using pandas may become untenable. Luckily, modern analytical databases (like DuckDB) are able to analyze this same tabular data, but perform orders-of-magnitude faster than pandas, all while using less memory. Many of these systems only provide a SQL interface though; something far different from pandas’ dataframe interface, requiring a rewrite of your analysis code.
This is where Ibis comes in. Ibis provides a common dataframe interface to many popular databases and analytics tools (BigQuery, Snowflake, Spark, DuckDB, …). This lets users analyze data using the same consistent API, regardless of which backend they’re using, and without ever having to learn SQL (but you can use SQL if you want to!). No more pains rewriting pandas code to something else when you run into performance issues; write your code once using Ibis and run it on any supported backend.
In this tutorial we’ll cover:
select
,filter
,group_by
,order_by
,join
, andaggregate
), and how these operations may be composed to form more complicated queries.This is a hands-on tutorial, with numerous examples to get your hands dirty. Participants should ideally have some experience using Python and pandas, but no SQL experience is necessary.
Tentative Schedule
0:00 - Intro and Setup “Going beyond pandas”
Get attendees up and running in a GitHub Codespace or on their laptops. A bit of motivation about the kinds of problems where Ibis can help, and a general survey of attendees to find out what their existing pain points and experiences are.
0:15 - Introduction to Ibis basics
A hands-on, follow-along notebook introducing the basic verbs of Ibis data analysis, (
select
,filter
,group_by
,order_by
, andaggregate
), with hands-on exercises throughout.1:00 - Coffee Break (5 minutes that definitely takes 10 minutes)
1:10 - In-memory tables, joins, and data analysis
Building on the previous notebook, we'll explore how to join in-memory data (from a pandas DataFrame, Python dictionary, or PyArrow Table) with existing tables in a local database and continue analysis on the join result.
We'll touch on the Ibis deferred operator for specifying predicates in chained joins, and demonstrate
read_parquet
and otherread_*
methods for loading local data into existing databases.Then we'll continue with a series of hands-on exercises, building up an analysis pipeline for some IMDB ratings data, but only operating on a 5% sample of the original dataset.
After, we show how the same expression can be computed on the full dataset without any code changes, both for local execution, or with bursting to a cloud database (or other hosted database).
2:00 - Coffee Break (5-10 minutes) + Q&A in the room
2:10 - Selectors
Continuing on from joins, we'll introduce
selectors
as a means of quickly renaming and cleaning datasets, a powerful feature ~stolen~ inspired bydplyr
.2:25 - UDFs and sql passthrough
Demonstrate using UDFs to add custom operations.
Explain and demonstrate various "escape hatches" when you really need to use SQL directly.
2:45 - PyPI data exploration and integration with the broader Python ecosystem
Demonstate projection pushdown and column pruning when operating on remote datasets. Explore questions about PyPI maintainers, search for typo-squatters, and try to find explanations for outliers using data from https://py-code.org/datasets.
Feed Ibis expressions into common plotting tools to look for outliers and demonstrate interoperability.
(Note: depending on conference wifi, even with column pruning and parquet files this may be untenable. We have backup exercises that perform the same analysis but make use of either the Clickhouse Playground or a sponsored SnowFlake account, so only basic internet connectivity will be required. Bonus: using Ibis means shifting to these backup options is a one-line operation!)
3:15- Continued PyPI data analysis examples and open Q&A
3:25 - Wrap-up
Past Experience
We have given versions of this tutorial (although shorter) at EuroSciPy 2023 and PyData NYC 2023 https://github.com/ibis-project/ibis-tutorial
Gil Forsyth
I'm an experienced instructor, having led tutorials at several PyData conferences, PyCon, and SciPy. I also ran internal training on Python and distributed data analysis at Capital One for several years. I am one of the core maintainers of Ibis.
Tutorials
Talks
Jim Crist-Harif
Jim is also a core maintainer of Ibis and one of the original contributors and long-time maintainer of Dask. He has presented many talks and tutorials, links are available on his website: https://jcristharif.com/talks.html