numfocus / YouTubeVideoTimestamps

Adding timestamps to NumFOCUS and PyData YouTube videos!
https://www.youtube.com/c/PyDataTV
MIT License
77 stars 19 forks source link

Cheryl Roberts - Parallelization of code in Python for beginners | PyData Global 2022 #155

Open emmcauley opened 1 year ago

emmcauley commented 1 year ago

Video URL: https://www.youtube.com/watch?v=bzdMHXDusOQ&list=WL&index=335

Contents 0:06 Talk Introduction 1:45 Parallelization use cases 3:15 Use Case 1: No dependencies across data or analysis 4:19 Use Case 2: Model scoring on a per-record basis 4:57 Parallelization Anti-Example: ML model learning and Training 6:01 Multithreading is not the same as multiprocessing 8:48 Key differences 9:42 Cores, CPUs, and computer memory 12:06 Use top to monitor processes 13:02 Multiprocessing suits more use cases and is used by joblib 14:28 Example ML workflow 16:18 Example Pre-processing: function vs joblib 18:57 Joblib hyperparameter tuning: job and chunk size 20:39 Writing a wrapper function for joblib 21:18 Calling joblib and the number of physical cores 22:22 joblib.Parallel and joblib.delayed 23:48 Results: timing 25:05 Results: data 25:57 Brief overview of GBM clasifier hyperparameter tuning 26:50 Joblib passes large Numpy arrays by reference and avoids data duplication 27:52 Avoid writing to overlapping segments in memory 28:08 Avoid multiprocessing calls to external servers 28:36 Other tips and tricks: first see how runtime scales, avoid crashing jobs by increasing number of tasks, and be aware that complex records can cause CPU spikes 29:39 Resources