sam-goodwin / packyak

Self-hosted, version-controlled data engineering platform for AWS
17 stars 1 forks source link

Ingestion #1

Open sam-goodwin opened 8 months ago

sam-goodwin commented 8 months ago

The first step in refining data is ingestion - we gotta get data from somewhere!

Data comes from many places:

Data needs to be updated:

Data needs to be organized:

Data will be consumed:

sam-goodwin commented 8 months ago

Streams loaded into Tables

  1. Use a Pydantic model to define data structures (TODO: should SQL Alchemy be considered here? Better for synthesizing SQL)
  2. Create real-time, partitioned and persistent Streams. Streams have semantics supported by Kafka, Kinesis, Red Panda, etc.
  3. Create a Database for hosting
from refinery import Stream, Table

class ClickEvent(BaseModel):
  click_id: str
  click_time: datetime
  ..

# create a data catalog for managing my company's databases, schemas and tables
company_catalog = DataCatalog(name="company_catalog")

# create a schema for one of the teams, the "retail website"
retail_website = my_catalog.add_schema(name="retail_website")

click_stream = company_catalog.add_stream[ClickEvent]("click_stream")

click_table = my_catalog.add_table[ClickEvent](
  name="clicks",
  partitioned_by="click_time",
)

click_stream.sink(click_table)

# sink the stream of clicks into the table
click_stream.sink(click_table)

# create a Redshift Database and include the click_table
click_redshift_db = RedshiftDB(
  name="click_db",
  tables=[
    click_table
  ])

TODO: WIP