Data Lake - Redshift Serverless + Glue Catalog + Athena

Sticking with our data-heavy backend use-cases, we should provide a convenient way to route data out into a scalable data lake backed by S3, a Glue Catalog and maybe Redshift Serverless.

S3 + Glue is a cheap and scalable data lake but it comes with the cost of maintaining partitions in S3, compaction, etc. A simple Firehose with Parquet transformation would be very useful and cost effective. It's straightforward to use this data as a starting point/source of truth to then load into any other data warehouse.

Redshift Serverless is especially compelling since cost of compute is decoupled from storage and the database does a lot for us in terms of maintaining data. We can take advantage of it to manage vacuum/compaction and it has managed storage backed by S3, so there doesn't seem to be many down sides? Should do a cost analysis.

What would the experience be? Could we use something like DrizzleKit ORM to model the relational schemas and then automatically create the tables in both Athena/Redshift? Or should we provide our own data model?

export const clicks = table("clicks", {
  columns: ..
});

// option 1 - put directly
export const onClickEvent = subscription("onClickEvent", {
  events: [ClickEvent],
}, async (event) => {
  await clicks.put(event);
});

// option 2 - a pipe (using Event Bridge Pipe direct integration)
export const archiveClicksPipe = ClickEvent.pipe("archiveClicks", clicks);

This code doesn't insert directly into the table - it will write to Kinesis Firehose.

Inspiration/integration opportunity: https://github.com/drizzle-team/drizzle-orm

sam-goodwin / eventual

Data Lake - Redshift Serverless + Glue Catalog + Athena #367