The following is a list of topics that need coverage in our demo notebooks. Feel free to open additional issues to break up into scheduled/distributable work items.
[ ] Python library and pipfile prerequisites and practices (touch OperateFirst later)
[ ] Use of credentials.env file (Catalog, Schema, UserID, useful Trino engine instance variables, defaults)
[ ] S3 Data Sources:
[ ] Public (unsigned) buckets (GLEIF, Aqueduct)
[ ] Private buckets using S3 credentials (SPGI Sustainability Reports)
[ ] Federated Data Sources:
[ ] Public data sources (SEC DERA data, PUDL data)
[ ] Using licensed data with credentials (Data Providers to name examples)
[ ] Using private data
[ ] Data Ingestion Formats
[ ] CSV, Excel, JSON, XML, XBRL
[ ] tar, zip, gzip
[ ] Geotiff and zar
[ ] SQLite
[ ] Parquet
Well-known data sources defined above (RMI, Aqueduct, etc) can be used to further teach/illustrate:
[ ] Basic distinctions between Ingestion and Processing
[ ] Data as Code (especially documenting data cleanup)
[ ] ETL vs ELT vs EtLT
[ ] Orchestrating multi-stage pipelines
[ ] Data lineage
[ ] Data discovery
[ ] Data distribution
[ ] Event triggers (new release of GLEIF, SEC data)
[ ] Data Commons access patterns
[ ] Trino CLI (old-school SQL) using laptop terminal or local-to-notebook terminal
[ ] SQLAlchemy (PUDL and other legacy "high-level" SQL interfaces)
[ ] JDBC
[ ] SuperSet (dashboards and BI) using local laptop, docker container, cloud-based resources
[ ] Jupyter Notebooks (see below) using local laptop, docker container, cloud-based resources
The following is a list of topics that need coverage in our demo notebooks. Feel free to open additional issues to break up into scheduled/distributable work items.
Well-known data sources defined above (RMI, Aqueduct, etc) can be used to further teach/illustrate:
The meat of using Jupyter Notebooks in the Data Pipeline architecture
As pipelines are built, they can go into stages of production:
Finally, some Orchestration Examples: