Matt Harrison - An Introduction to Pandas 2, Polars, and DuckDB

Finished timestamps for this video: https://www.youtube.com/watch?v=vy8VrhaYR2M
Title: Matt Harrison - An Introduction to Pandas 2, Polars, and DuckDB | PyData Global 2023
- Timestamps: 00:00 - General introduction 03:33 - About Matt 06:50 - Pandas 2 introduction 10:08 - Presentation of Pandas 2 main feature no 1, using pyarrow for dtype backend instead of numpy 12:28 - Presentation of Pandas 2 main feature no 2, copy on write 13:07 - Start of Pandas 2 with pyarrow example in Jupyter Notebook 15:16 - Dealing with columns for which pyarrow did not detect dtype by default 18:28 - Presenting the actions on the dataset implemented with numpy 19:11 - Inefficiencies of .apply function in pandas 20:40 - Presenting the actions on the dataset implemented with a vectorized function 21:38 - Processing time benchmark between the .apply and the vectorized solutions 24:09 - Audience question: Are there any backwards compatibility issues between Pandas 2 and Pandas 1? 26:55 - Audience question: Are there any reasons not to use pyarrow? 27:40 - Audience question: How can I easily migrate to Polars or handle the missing index? 29:06 - Polars introduction 36:04 - Start of Polars example in Jupyter Notebook 36:11 - Audience question: Can Polars run in a distributed way? 36:34 - Polars example with the eager implementation 38:30 - Polars eager example - convert column dtypes to dates where auto-detection didn't work 40:18 - Polars eager example - implementation of the Pandas numpy .apply in Polars 42:40 - Polars eager example - processing time benchmark 43:14 - Considerations of Pandas vs Polars speed 45:16 - Polars example with the lazy implementation 47:20 - Answer to the question: Can Polars run in a distributed way? 48:50 - Audience question: Is there an advantage to using Polars over pyspark? 52:33 - Audience question: Is there an advantage to using Polars over Daft? 53:50 - Introduction to DuckDB in the context of dataframes and tabular data 55:48 - DuckDB background and main features 58:08 - Start of DuckDB example in Jupyter Notebook using SQL 58:56 - DuckDB how to load data 1:01:30 - Audience Question: What is a median-sized dataset? 1:02:20 - DuckDB complicated query example 1:03:07 - DuckDB Arrow integration 1:04:48 - Audience Question: Where can I get a copy of temp bill file? 1:05:26 - Main conclusions and aspects related to switching from Pandas to Polars 1:09:21 - Audience consideration: The Pandas pyarrow integration is incomplete (ref dt accessor) 1:11:10 - Audience question: How do you deal with reading variables as strings in DuckDB? 1:12:16 - Audience question: What tool do you recommend to start learning as a beginner? 1:12:32 - Presentation of Tabular Tools (API & Scale) chart 1:16:12 - Answer to the question: What tool do you recommend to start learning as a beginner? 1:16:43 - Audience question: Will 'Effective Pandas 2' book have the same datasets as 'Effective Pandas' original edition? 1:18:06 - Audience question about mass renaming variables 1:19:54 - Which tool to use of the ones presented? 1:21:48 - Matt contact details and areas of expertise
- Resources:
- Github repo of the notebook used in the presentation: https://github.com/mattharrison/talks/tree/2023-12-pydata
- Books:
  - Effective Pandas - Metasnake https://store.metasnake.com/effective-pandas1-book
  - Effective Pandas - Amazon https://www.amazon.com/Effective-Pandas-Patterns-Manipulation-Treading/dp/B09MYXXSFM
  - Effective Pandas 2 - Metasnake https://store.metasnake.com/effective-pandas-book
  - Effective Pandas 2 - Amazon https://www.amazon.com/gp/product/B0CSRGH8R3?ref_=dbs_m_mng_rwt_calw_tpbk_3&storeType=ebooks

numfocus / YouTubeVideoTimestamps

Matt Harrison - An Introduction to Pandas 2, Polars, and DuckDB | PyData Global 2023 #207