Scaling Data Science with Dask

Abstract (2-3 lines)

Python data science tools like pandas, NumPy, and scikit-learn are excellent. However, they use only one core out of the many cores in modern processors and are limited by your computer RAM. In this tutorial, you'll learn to scale your data science workflow to larger datasets+models using Dask, by leveraging the full potential of your laptop, all while staying in the PyData ecosystem. You will learn the fundamentals of parallel and distributed computing, when (and when not) to consider scaling, and work through some hands-on examples.

Brief Description and Contents to be covered

Dask is an open source library for parallel and distributed computing in Python. This tutorial is meant to be an introduction to this super broad and powerful library. We will:

Build vocabulary: What is parallel and distributed computing? What are clusters? What do we mean by "scaling to the cloud"?
Introduce Dask: What is Dask? How does it work? Where is it used?
Learn the Dask DataFrame API, which mimics the pandas API -- how are the two APIs similar, and where do they differ?
Talk about Dask's Distributed Scheduler and explore Dask's (very cool) diagnostic Dashboards
Briefly cover the low-level Dask Delayed API, which can parallelize any general Python code
Conclude with some best practices and discuss resources for learning more

Pre-requisites for the talk

Programming fundamentals in Python (e.g variables, data structures, for loops, etc.)
A bit of or are familiarized with NumpP, pandas, and scikit-learn
Jupyter Lab / Jupyter Notebooks
Way around the shell/terminal

Time required for the talk

1 hr

Link to slides

https://github.com/pavithraes/dask-mini-tutorial/blob/main/slides.pdf

Will you be doing hands-on demo as well?

Yes

Link to ipython notebook (if any)

https://github.com/pavithraes/dask-mini-tutorial

About yourself

My name is Pavithra Eswaramorthy. I currently work as a Community Engagement Manager at Coiled, where I help support Dask users and contributors. I also contribute to the Bokeh project and I've worked on administrating Wikimedia Foundation’s open source outreach programs in the past. In my spare time, I enjoy a good book and hot coffee. :)

Are you comfortable if the talk is recorded and uploaded to PyData Dellhi's YouTube channel?

Yes

pydatadelhi / talks