Introduction to NumPy, tabular data and visualization
This is the second day's content which will focus on introducing the core numerical scientific computing library (NumPy), upon which all ML in Python is built. In addition, there will be a focus on data visualization using Matplotlib.
Learning objective
Students will finish this module with an understanding of NumPy and Pandas, the premier numerical scientific computing and tabular data analysis libraries in Python. Specifically, students will understand how to manipulate (which will require a rudimentary understanding/review of algebra) and visualize data in array form.
Content to cover
A note on the difference between the standard library and other libraries
We'll be using import numpy as np here. Where is numpy?
A note on documentation of popular packages (link to NumPy, Pandas and Matplotlib docs).
Introduction to NumPy
What is NumPy, and why do we care? Go over importance of NumPy in scientific computing.
What are NumPy's key features?
Creating arrays via np.array() (from other Python objects like lists).
Creating arrays from NumPy functions like np.arange, np.linspace and np.random.
Basic operations in NumPy
Element wise operations.
Reshaping, stacking and splitting arrays.
Indexing and slicing one and multi-dimensional arrays.
Advanced operations in NumPy
Broadcasting.
Statistical operations: mean, median, standard deviation, cumulative sum, etc.
Matrix multiplication, np.dot.
Introduction to Pandas
What is Pandas, why is it important? Why do we need another numerical computing library built on top of NumPy?
What are Pandas's key features?
Introduction to the pd.DataFrame and pd.Series objects.
Creating Pandas objects
Creating a DataFrame from dictionaries, lists and NumPy arrays.
Reading and writing Pandas objects to and from disk.
Basic operations: selecting, filtering, manipulating data. Adding and deleting columns.
Indexing and slicing with .loc and .iloc. What's the difference?
Data manipulation
Identifying and dealing with mising data ("not-a-number"s).
Applying functions to columns. Generating new columns from functions.
Combining tables.
Introduction to data analysis
Correlation between numeric columns.
Plotting histograms of columns (combining knowledge from day 1 with this module).
Introduction to Matplotlib
Why is visualizing data important?
What are Matplotlib's key features?
Basic plotting: line, scatter and bar plots.
Basic customizing of plots, including titles, labels and legends.
Advanced plotting
Multiple plots in one figure (subplots).
Colorbars on scatter plots.
Heatmaps.
Histograms/density plots (students might have accidentally done this in the Pandas section!).
Boxplots.
Capstone
Students will pretend to be data scientists at a company, tasked with presenting an analysis of some dataset to management. Students should go on Hugging Face, Kaggle, or some other open-access online database platform, download and analyze some dataset. Emphasis should be placed on visualizing the data (remember, management doesn't have time to read a bunch of text or tabular data, they want to see informative figures!). For example, The Spotify Tracks Dataset on Hugging Face is a good place to start.
Introduction to NumPy, tabular data and visualization
This is the second day's content which will focus on introducing the core numerical scientific computing library (NumPy), upon which all ML in Python is built. In addition, there will be a focus on data visualization using Matplotlib.
Learning objective
Students will finish this module with an understanding of NumPy and Pandas, the premier numerical scientific computing and tabular data analysis libraries in Python. Specifically, students will understand how to manipulate (which will require a rudimentary understanding/review of algebra) and visualize data in array form.
Content to cover
A note on the difference between the standard library and other libraries
import numpy as np
here. Where isnumpy
?Introduction to NumPy
np.array()
(from other Python objects like lists).np.arange
,np.linspace
andnp.random
.Basic operations in NumPy
Advanced operations in NumPy
np.dot
.Introduction to Pandas
pd.DataFrame
andpd.Series
objects.Creating Pandas objects
DataFrame
from dictionaries, lists and NumPy arrays..loc
and.iloc
. What's the difference?Data manipulation
Introduction to data analysis
Introduction to Matplotlib
Advanced plotting
Capstone
Students will pretend to be data scientists at a company, tasked with presenting an analysis of some dataset to management. Students should go on Hugging Face, Kaggle, or some other open-access online database platform, download and analyze some dataset. Emphasis should be placed on visualizing the data (remember, management doesn't have time to read a bunch of text or tabular data, they want to see informative figures!). For example, The Spotify Tracks Dataset on Hugging Face is a good place to start.