Contents
0:00 Acknowledgements and Sponsors
0:37 Housekeeping Items for In-Person Attendees
1:41 PyData Code of Conduct
2:41 Icebreaker for In-Person Attendees
4:20 Announcement: This Month in Data Science: DataFramed Podcast
5:02 Announcement: Open Jobs with Event Co-Sponsor
5:21 Announcement: New Meetup, PyData Toronto
5:34 Announcement: Next Event
5:55 Speaker Introduction by Host
6:17 Talk Introduction: PCA, t-SNE, UMAP
6:41 Speaker Introduction by Speaker
7:08 What is Dimension Reduction?
7:47 MNIST-Based Example
8:27 Fashion-MNIST Example
8:57 How do you do Dimension Reduction?
9:12 Technique 1: Matrix Factorization
9:39 Technique 2: Neighbor Graphs
10:02 Principal Component Analysis (PCA): Introduction
11:10 PCA: Algorithmic Underpinnings
12:27 PCA: Toy Input Data, Transformation, and Scatterplot
14:33 PCA: MNIST Digits Data Scatterplot
15:37 PCA: Fashion-MNIST Data Scatterplot
15:52 t-Distributed Stochastic Neighbor Embedding (t-SNE): Introduction
16:57 t-SNE: Algorithmic Underpinnings
23:10 t-SNE: MNIST Digits Data Scatterplot
23:42 t-SNE: Fashion-MNIST Data Scatterplot
24:01 Uniform Manifold Approximation and Projection (UMAP): Introduction
25:11 UMAP: Algorithmic Underpinnings (Topological Data Analysis and Simplicial Complexes)
27:35 UMAP: Toy Input Data, Transformation, and Scatterplot
28:47 UMAP Caveat: UMAP Needs Uniform Distribution of Data
29:46 UMAP: Define a Riemannian Metric on the Manifold to Conform to Uniform Distribution
30:06 UMAP: Brief Primer on Manifold Theory
31:50 UMAP: Fuzzy Cover Concept
33:31 UMAP Assumption: Manifold is Locally Connected
34:36 UMAP Distribution of Distances for 20 Nearest Neighbors
35:45 UMAP Local Metrics are Incompatible
37:52 UMAP: Toy Input Data, Transformation, and Graph
40:10 UMAP: MNIST Digits Data Scatterplot
40:41 t-SNE: Fashion-MNIST Data Scatterplot
41:03 UMAP: Implementation and Constraints
41:58 Use of Numba Library
43:27 UMAP is Faster than t-SNE on 4 Datasets
44:24 Additional UMAP Use Cases
45:20 UMAP Can Use Labels for Supervised Dimension Reduction
46:28 UMAP Can Leverage Metric Learning
47:30 UMAP Scales Well to Many Different Labels, Distances, and Data Types
48:16 UMAP Can Work with Pandas DataFrames
48:38 Wrap-Up
48:44 Conclusion 1: PCA is Interpretable Dimension Reduction
49:07 Conclusion 2: t-SNE Works Great
49:11 Conclusion 3: UMAP Improved on t-SNE by Being Theoretically Grounded
49:30 UMAP Github Resource, Conda and Pip Packages
49:44 Q&A 1 — How do you assume uniform distribution and construct the manifold from that distribution?
50:32 Q&A 2 — In your MNIST example, does the positional information matter in dimensional reduction like it does in conv-nets?
51:48 Q&A 3 — Can you start from latent space and find corresponding data?
52:14 Q&A 4 — What is an example of how to measure distance within UMAP?
53:11 Q&A 5 — How should we interpret position along the axes of the UMAP scatterplots?
54:06 Q&A 6 — How do you figure out what features were important for the cluster classification within UMAP?
55:07 Q&A 7 — What kind of distance metric is suitable for a categorical label?
55:42 Q&A 8 — Could you provide more context on the appropriate feature space for HDBSCAN and UMAP?
57:10 Q&A 9 — Have you tried UMAP on autocorrelated data?
57:42 Thank you!
Video URL: https://www.youtube.com/watch?v=YPJQydzTLwQ
Contents 0:00 Acknowledgements and Sponsors 0:37 Housekeeping Items for In-Person Attendees 1:41 PyData Code of Conduct 2:41 Icebreaker for In-Person Attendees 4:20 Announcement: This Month in Data Science: DataFramed Podcast 5:02 Announcement: Open Jobs with Event Co-Sponsor 5:21 Announcement: New Meetup, PyData Toronto 5:34 Announcement: Next Event 5:55 Speaker Introduction by Host 6:17 Talk Introduction: PCA, t-SNE, UMAP 6:41 Speaker Introduction by Speaker 7:08 What is Dimension Reduction? 7:47 MNIST-Based Example 8:27 Fashion-MNIST Example 8:57 How do you do Dimension Reduction? 9:12 Technique 1: Matrix Factorization 9:39 Technique 2: Neighbor Graphs 10:02 Principal Component Analysis (PCA): Introduction 11:10 PCA: Algorithmic Underpinnings 12:27 PCA: Toy Input Data, Transformation, and Scatterplot 14:33 PCA: MNIST Digits Data Scatterplot 15:37 PCA: Fashion-MNIST Data Scatterplot 15:52 t-Distributed Stochastic Neighbor Embedding (t-SNE): Introduction 16:57 t-SNE: Algorithmic Underpinnings
23:10 t-SNE: MNIST Digits Data Scatterplot 23:42 t-SNE: Fashion-MNIST Data Scatterplot 24:01 Uniform Manifold Approximation and Projection (UMAP): Introduction 25:11 UMAP: Algorithmic Underpinnings (Topological Data Analysis and Simplicial Complexes) 27:35 UMAP: Toy Input Data, Transformation, and Scatterplot 28:47 UMAP Caveat: UMAP Needs Uniform Distribution of Data 29:46 UMAP: Define a Riemannian Metric on the Manifold to Conform to Uniform Distribution 30:06 UMAP: Brief Primer on Manifold Theory 31:50 UMAP: Fuzzy Cover Concept 33:31 UMAP Assumption: Manifold is Locally Connected 34:36 UMAP Distribution of Distances for 20 Nearest Neighbors 35:45 UMAP Local Metrics are Incompatible 37:52 UMAP: Toy Input Data, Transformation, and Graph 40:10 UMAP: MNIST Digits Data Scatterplot
40:41 t-SNE: Fashion-MNIST Data Scatterplot 41:03 UMAP: Implementation and Constraints 41:58 Use of Numba Library 43:27 UMAP is Faster than t-SNE on 4 Datasets 44:24 Additional UMAP Use Cases 45:20 UMAP Can Use Labels for Supervised Dimension Reduction 46:28 UMAP Can Leverage Metric Learning 47:30 UMAP Scales Well to Many Different Labels, Distances, and Data Types 48:16 UMAP Can Work with Pandas DataFrames 48:38 Wrap-Up 48:44 Conclusion 1: PCA is Interpretable Dimension Reduction 49:07 Conclusion 2: t-SNE Works Great 49:11 Conclusion 3: UMAP Improved on t-SNE by Being Theoretically Grounded 49:30 UMAP Github Resource, Conda and Pip Packages 49:44 Q&A 1 — How do you assume uniform distribution and construct the manifold from that distribution? 50:32 Q&A 2 — In your MNIST example, does the positional information matter in dimensional reduction like it does in conv-nets? 51:48 Q&A 3 — Can you start from latent space and find corresponding data? 52:14 Q&A 4 — What is an example of how to measure distance within UMAP? 53:11 Q&A 5 — How should we interpret position along the axes of the UMAP scatterplots? 54:06 Q&A 6 — How do you figure out what features were important for the cluster classification within UMAP? 55:07 Q&A 7 — What kind of distance metric is suitable for a categorical label? 55:42 Q&A 8 — Could you provide more context on the appropriate feature space for HDBSCAN and UMAP? 57:10 Q&A 9 — Have you tried UMAP on autocorrelated data? 57:42 Thank you!