Tim Orme -- Simplicity For Scale- Analyzing 15 Million DNA Samples With Python

Video URL: https://www.youtube.com/watch?v=4PI5iHfO6PE

Contents 0:00 About Presenter 0:49 About Ancestry.com 1:25 About Ancestry.com Consumer Genomics Products and Results 3:49 Production Data Pipelines and Main Use Cases 5:35 About Ancestry.com’s Main Pipeline 7:06 Admixture Germline Tool in Perl 8:01 History and Challenges of the Pipeline 8:56 Challenge — Sequence Matching and Quadratic Performance 12:25 Solution — Distributed Computing 13:07 Re-Implementation in Hadoop 14:35 Re-Examining Implementation and Pipeline Problems 15:26 Hadoop — Scheduling and Batching 18:15 Ancestry.com Data Growth 20:44 Hadoop — Ecosystem Components and Use 21:42 Developing A New Matching Version 23:17 Switching to Python and the Justifications 24:57 Solving Batching Challenges — An Introduction to Celery 25:33 Celery — Usage Example 28:08 Celery — Application in Python 28:51 Revamped Ecosystem 29:47 Solving Matching Challenges — Accelerating Germline in C++ 31:40 Results — CPU Graph and Acceleration Metrics 32:35 Python — Pros of Implementation 35:43 Takeaways 40:08 Acknowledgements 40:33 Q&A — Did you optimize the core algorithm? 42:44 Q&A — Did you also try Dask and/or compare it with Celery? 44:33 Q&A — Did you use a similar distributed pipeline for initial processing? 46:34 Q&A — Are you using graph databases here? 47:27 Q&A — Is any part of Germline open-source? 48:38 Q&A — What happens if there are duplicates in the data? 49:35 Q&A — What if the same person submits two samples over time?

numfocus / YouTubeVideoTimestamps

Tim Orme -- Simplicity For Scale- Analyzing 15 Million DNA Samples With Python | PyData LA 2019 #150