Eduardo Blancas - Developing Maintainable Data Pipelines With Jupyter and Ploomber | PyData Chicago

1:33 - Introduction

2:01 - Before we start

3:28 - Problem: Jupyter Notebook code reviews are confusing

5:15 - Problem: It is difficult to collaborate over Jupyter Notebooks

6:43 - Solution: Scripts as notebooks (use jupytext to allow source code to be .py files instead of .ipynb)

8:03 - Solution: Modularization (break down Jupyter Notebook into multiply files with clear boundaries)

11:28 - Solution: Testing (Ploomber allows user to embed data quality tests)

12:55 - Solution: Reproducibility and Collaboration

13:25 - Demo

14:08 - Starting a new project

14:42 - pipeline.yaml explanation

18:05 - Demoing Scripts as notebooks

19:00 - ploomber plot

20:39 - Automatically created cells based on upstream dependencies and pipeline.yaml preferences

22:36 - Showing a different pipeline that has some logic

24:51 - How one would run their pipeline

26:20 - Demoing incremental builds feature (modularization)

28:31 - Demoing testing

29:20 - Cloud

30:01 - Conclusion

30:30 - Questions begin

30:36 - Q1: Can tasks be executed in parallel?

31:14 - Q2: How is this different from Elyra

33:28 - Q3: Is there a specific part of this [data science] workflow that you think Ploomber is better for?

36:19 - Q4: Is Ploomber a hobby or full time for you?

39:05 - Q5: Can the input be a Jupyter file or does it have to be a .py file?

40:06 - Q6: Do you imagine the input script could be a SQL script or something else in the future?

42:12 - Q7: Is there a way to specify software (Matlab, etc) in the pipeline?

43:00: Questions end

numfocus / YouTubeVideoTimestamps