If you are interested in becoming a co-author (any chunk of interesting content is fine) please ping us on matrix
The self-service data science reference architecture
Description: This chapter introduces the open source data science platform and their components
Using the Storage Layer
Description: This chapter explains how a modern data lake storage layer looks like. It introduces low level storage (Ceph for onprem, S3 of cloud), data mart for highly interactive queries (PostgreSQL) and data virtualization (Apache Iceberg/Teiid)
Data Ingestion and Transformation
Description: This chapter describes how to ingest data into the data lake. It covers standard tasks like getting data from a (non) relational data source into the data lake including incremental updates and versioning. It also explains on how to access the storage layer using standard SQL and how to ingest subsets of data into a faster storage provider (PostgreSQL) supporting interactive queries
Data Exploration
Description: This chapter explains how data exploration can be done efficiently. It covers pandas as this is the de-facto standard for the task. But it also supports SQL as this is still used and understood by a majority of data scientists. Koalas supports the pandas API on SparkSQL. Besides slicing and icing, also data visualization is an incremental part of this task. This is why usage of seaboarn is introduced. Finally, BakerX promises to support the complete task of data exploration in a low code environment as jupyter plugin
Dashboarding
Description: This chapter explains how reports / dashboards can be created as a deliverable to end users. In reality, this is where most of the current data science projects end. This is also a good starting point to discuss further automatization using machine learning
Model Development
Description: This chapter explains how machine models should be built. It doesn’t introduce the frameworks used, but shows and exemplifies best practices on how those tools are used in an production context
Model Assessment
Description: Machine learning (and especially Deep Learning) models need to be proven to be robust before production deployment. This chapter explains how to accomplish these tasks in an automated way (CI/CD compatibility)
Model Deployment / CI/CD
Description: This chapter explains how to contineously integrate and deploy data products and machine learning models. It shows how reproducibility and transparency can be achieved.