Contents
0:00 About Presenter
0:49 About Ancestry.com
1:25 About Ancestry.com Consumer Genomics Products and Results
3:49 Production Data Pipelines and Main Use Cases
5:35 About Ancestry.com’s Main Pipeline
7:06 Admixture Germline Tool in Perl
8:01 History and Challenges of the Pipeline
8:56 Challenge — Sequence Matching and Quadratic Performance
12:25 Solution — Distributed Computing
13:07 Re-Implementation in Hadoop
14:35 Re-Examining Implementation and Pipeline Problems
15:26 Hadoop — Scheduling and Batching
18:15 Ancestry.com Data Growth
20:44 Hadoop — Ecosystem Components and Use
21:42 Developing A New Matching Version
23:17 Switching to Python and the Justifications
24:57 Solving Batching Challenges — An Introduction to Celery
25:33 Celery — Usage Example
28:08 Celery — Application in Python
28:51 Revamped Ecosystem
29:47 Solving Matching Challenges — Accelerating Germline in C++
31:40 Results — CPU Graph and Acceleration Metrics
32:35 Python — Pros of Implementation
35:43 Takeaways
40:08 Acknowledgements
40:33 Q&A — Did you optimize the core algorithm?
42:44 Q&A — Did you also try Dask and/or compare it with Celery?
44:33 Q&A — Did you use a similar distributed pipeline for initial processing?
46:34 Q&A — Are you using graph databases here?
47:27 Q&A — Is any part of Germline open-source?
48:38 Q&A — What happens if there are duplicates in the data?
49:35 Q&A — What if the same person submits two samples over time?
Video URL: https://www.youtube.com/watch?v=4PI5iHfO6PE
Contents 0:00 About Presenter 0:49 About Ancestry.com 1:25 About Ancestry.com Consumer Genomics Products and Results 3:49 Production Data Pipelines and Main Use Cases 5:35 About Ancestry.com’s Main Pipeline 7:06 Admixture Germline Tool in Perl 8:01 History and Challenges of the Pipeline 8:56 Challenge — Sequence Matching and Quadratic Performance 12:25 Solution — Distributed Computing 13:07 Re-Implementation in Hadoop 14:35 Re-Examining Implementation and Pipeline Problems 15:26 Hadoop — Scheduling and Batching 18:15 Ancestry.com Data Growth 20:44 Hadoop — Ecosystem Components and Use 21:42 Developing A New Matching Version 23:17 Switching to Python and the Justifications 24:57 Solving Batching Challenges — An Introduction to Celery 25:33 Celery — Usage Example 28:08 Celery — Application in Python 28:51 Revamped Ecosystem 29:47 Solving Matching Challenges — Accelerating Germline in C++ 31:40 Results — CPU Graph and Acceleration Metrics 32:35 Python — Pros of Implementation 35:43 Takeaways 40:08 Acknowledgements 40:33 Q&A — Did you optimize the core algorithm? 42:44 Q&A — Did you also try Dask and/or compare it with Celery? 44:33 Q&A — Did you use a similar distributed pipeline for initial processing? 46:34 Q&A — Are you using graph databases here? 47:27 Q&A — Is any part of Germline open-source? 48:38 Q&A — What happens if there are duplicates in the data? 49:35 Q&A — What if the same person submits two samples over time?