PyData Ann Arbor: Leland McInnes | PCA, t-SNE, and UMAP: Modern Approaches to Dimension Reduction

Video URL: https://www.youtube.com/watch?v=YPJQydzTLwQ

Contents 0:00 Acknowledgements and Sponsors 0:37 Housekeeping Items for In-Person Attendees 1:41 PyData Code of Conduct 2:41 Icebreaker for In-Person Attendees 4:20 Announcement: This Month in Data Science: DataFramed Podcast 5:02 Announcement: Open Jobs with Event Co-Sponsor 5:21 Announcement: New Meetup, PyData Toronto 5:34 Announcement: Next Event 5:55 Speaker Introduction by Host 6:17 Talk Introduction: PCA, t-SNE, UMAP 6:41 Speaker Introduction by Speaker 7:08 What is Dimension Reduction? 7:47 MNIST-Based Example 8:27 Fashion-MNIST Example 8:57 How do you do Dimension Reduction? 9:12 Technique 1: Matrix Factorization 9:39 Technique 2: Neighbor Graphs 10:02 Principal Component Analysis (PCA): Introduction 11:10 PCA: Algorithmic Underpinnings 12:27 PCA: Toy Input Data, Transformation, and Scatterplot 14:33 PCA: MNIST Digits Data Scatterplot 15:37 PCA: Fashion-MNIST Data Scatterplot 15:52 t-Distributed Stochastic Neighbor Embedding (t-SNE): Introduction 16:57 t-SNE: Algorithmic Underpinnings
23:10 t-SNE: MNIST Digits Data Scatterplot 23:42 t-SNE: Fashion-MNIST Data Scatterplot 24:01 Uniform Manifold Approximation and Projection (UMAP): Introduction 25:11 UMAP: Algorithmic Underpinnings (Topological Data Analysis and Simplicial Complexes) 27:35 UMAP: Toy Input Data, Transformation, and Scatterplot 28:47 UMAP Caveat: UMAP Needs Uniform Distribution of Data 29:46 UMAP: Define a Riemannian Metric on the Manifold to Conform to Uniform Distribution 30:06 UMAP: Brief Primer on Manifold Theory 31:50 UMAP: Fuzzy Cover Concept 33:31 UMAP Assumption: Manifold is Locally Connected 34:36 UMAP Distribution of Distances for 20 Nearest Neighbors 35:45 UMAP Local Metrics are Incompatible 37:52 UMAP: Toy Input Data, Transformation, and Graph 40:10 UMAP: MNIST Digits Data Scatterplot
40:41 t-SNE: Fashion-MNIST Data Scatterplot 41:03 UMAP: Implementation and Constraints 41:58 Use of Numba Library 43:27 UMAP is Faster than t-SNE on 4 Datasets 44:24 Additional UMAP Use Cases 45:20 UMAP Can Use Labels for Supervised Dimension Reduction 46:28 UMAP Can Leverage Metric Learning 47:30 UMAP Scales Well to Many Different Labels, Distances, and Data Types 48:16 UMAP Can Work with Pandas DataFrames 48:38 Wrap-Up 48:44 Conclusion 1: PCA is Interpretable Dimension Reduction 49:07 Conclusion 2: t-SNE Works Great 49:11 Conclusion 3: UMAP Improved on t-SNE by Being Theoretically Grounded 49:30 UMAP Github Resource, Conda and Pip Packages 49:44 Q&A 1 — How do you assume uniform distribution and construct the manifold from that distribution? 50:32 Q&A 2 — In your MNIST example, does the positional information matter in dimensional reduction like it does in conv-nets? 51:48 Q&A 3 — Can you start from latent space and find corresponding data? 52:14 Q&A 4 — What is an example of how to measure distance within UMAP? 53:11 Q&A 5 — How should we interpret position along the axes of the UMAP scatterplots? 54:06 Q&A 6 — How do you figure out what features were important for the cluster classification within UMAP? 55:07 Q&A 7 — What kind of distance metric is suitable for a categorical label? 55:42 Q&A 8 — Could you provide more context on the appropriate feature space for HDBSCAN and UMAP? 57:10 Q&A 9 — Have you tried UMAP on autocorrelated data? 57:42 Thank you!

numfocus / YouTubeVideoTimestamps

PyData Ann Arbor: Leland McInnes | PCA, t-SNE, and UMAP: Modern Approaches to Dimension Reduction #151