Compendium of Demo and Documentation Subjects

The following is a list of topics that need coverage in our demo notebooks. Feel free to open additional issues to break up into scheduled/distributable work items.

[ ] Python library and pipfile prerequisites and practices (touch OperateFirst later)
[ ] Use of credentials.env file (Catalog, Schema, UserID, useful Trino engine instance variables, defaults)
[ ] S3 Data Sources:
- [ ] Public (unsigned) buckets (GLEIF, Aqueduct)
- [ ] Private buckets using S3 credentials (SPGI Sustainability Reports)
[ ] Federated Data Sources:
- [ ] Public data sources (SEC DERA data, PUDL data)
- [ ] Using licensed data with credentials (Data Providers to name examples)
- [ ] Using private data
[ ] Data Ingestion Formats
- [ ] CSV, Excel, JSON, XML, XBRL
- [ ] tar, zip, gzip
- [ ] Geotiff and zar
- [ ] SQLite
- [ ] Parquet

Well-known data sources defined above (RMI, Aqueduct, etc) can be used to further teach/illustrate:

[ ] Basic distinctions between Ingestion and Processing
- [ ] Data as Code (especially documenting data cleanup)
- [ ] ETL vs ELT vs EtLT
- [ ] Orchestrating multi-stage pipelines
- [ ] Data lineage
- [ ] Data discovery
- [ ] Data distribution
- [ ] Event triggers (new release of GLEIF, SEC data)
[ ] Data Commons access patterns
- [ ] Trino CLI (old-school SQL) using laptop terminal or local-to-notebook terminal
- [ ] SQLAlchemy (PUDL and other legacy "high-level" SQL interfaces)
- [ ] JDBC
- [ ] SuperSet (dashboards and BI) using local laptop, docker container, cloud-based resources
- [ ] Jupyter Notebooks (see below) using local laptop, docker container, cloud-based resources
[ ] Notebook sizing (PVC claim, memory, CPUs, GPUs) and profiling

The meat of using Jupyter Notebooks in the Data Pipeline architecture

[ ] Using Notebook-local storage to process data (/tmp vs in-memory buffers)
[ ] Using Pandas dataframes to ingest/process data
- [ ] object dtype vs. convert_dtypes
- [ ] Pandas NA vs. Numpy nan vs. other types of NULL
- [ ] Adding units with pint and pint-pandas
- [ ] Geospatial data types
- [ ] Timeseries best practices (numeric column names, wide vs. long data)
- [ ] Tidy data (melt, wide_to_long, pivot, pivot_table, crosstab)
- [ ] Dask and other big-data frameworks
[ ] Interfacing Pandas with Trino/Iceberg (and best tool for what job)
- [ ] Creating, Deleting tables
- [ ] Appending and Deleting rows from tables (incl. chunking strategies)
- [ ] Snapshots
- [ ] Partitions and partitioning
- [ ] Data Integrity (Index and Foreign Key validation)
- [ ] Metadata (v1)
  - [ ] Data Dictionary
  - [ ] Units
  - [ ] Tags
  - [ ] Lineage
- [ ] Views and Materialized Views
- [ ] Schema evolution
[ ] Data Access Management (Data Vault)
- [ ] User and Role management
- [ ] Catalog, Schema, and Table-level
- [ ] Column and Row level
[ ] Visualization
- [ ] plot functions within Notebook
- [ ] SuperSet Dashboards
- [ ] Other BI considerations

As pipelines are built, they can go into stages of production:

[ ] CI/CD, GitHub actions, and Operate First
- [ ] People, Bots, and finding the right Issue Queue
- [ ] Data Catalog, Schema, User-Roles
- [ ] Pipeline publication and status
- [ ] Events and Monitoring
- [ ] Data validation and lineage

Finally, some Orchestration Examples:

[ ] Physical Risk
[ ] Implied Temperature Rise
[ ] Transition Analysis

os-climate / data-platform-demo

Compendium of Demo and Documentation Subjects #48