sam-goodwin / packyak

Self-hosted, version-controlled data engineering platform for AWS

18 stars 1 forks source link

The first step in refining data is ingestion - we gotta get data from somewhere!

Data comes from many places:

[ ] Streaming data (logs, Kinesis Streams, SQS Queues, Bucket Notifications)
[ ] Buckets (unstructured bags of files)
[ ] Buckets (structured in JSON, Parquet, Orc, etc. and indexed by a Data Catalog)
[ ] OLTP databases (DynamoDB, PostgresQL, Mongo, etc.)
[ ] OLAP databases (Redshift, Snowflake, etc.)

Data needs to be updated:

[ ] Regular loads from upstream databases or buckets
[ ] Streaming changes coming in via Kinesis, Queues

Data needs to be organized:

[ ] De-duplication
[ ] Compaction of small files into large files
[ ] Vacuums of data warehouse tables
[ ] Backfills of missing or corrupt data

Data will be consumed:

[ ] As a real-time stream for online analysis
[ ] As time-windowed batches
[ ] As both ^ (aka. the Lambda Architecture)
[ ] Adhoc queries

Streams loaded into Tables

Use a Pydantic model to define data structures (TODO: should SQL Alchemy be considered here? Better for synthesizing SQL)
Create real-time, partitioned and persistent Streams. Streams have semantics supported by Kafka, Kinesis, Red Panda, etc.
- [ ] We'll start with Kinesis Serverless since it's simplest and already in AWS
- [ ] Then add Managed Serverless Kafka (MSK)
Create a Database for hosting

from refinery import Stream, Table

class ClickEvent(BaseModel):
  click_id: str
  click_time: datetime
  ..

# create a data catalog for managing my company's databases, schemas and tables
company_catalog = DataCatalog(name="company_catalog")

# create a schema for one of the teams, the "retail website"
retail_website = my_catalog.add_schema(name="retail_website")

click_stream = company_catalog.add_stream[ClickEvent]("click_stream")

click_table = my_catalog.add_table[ClickEvent](
  name="clicks",
  partitioned_by="click_time",
)

click_stream.sink(click_table)

# sink the stream of clicks into the table
click_stream.sink(click_table)

# create a Redshift Database and include the click_table
click_redshift_db = RedshiftDB(
  name="click_db",
  tables=[
    click_table
  ])

TODO: WIP

sam-goodwin / packyak

Ingestion #1

Streams loaded into Tables