Docs: Design doc for data warehouse

ryscheng commented 11 months ago

What is the improvement or update you wish to see?

Not looking for a proper design doc (bullet points fine), but we should at least put a little more thought into it to include:

[x] How we can get blockchain transactions/traces for other blockchains (e.g. cryo vs substreams)
[x] How to reconcile event data from a product analytics tool (e.g. Amplitude) and on-chain user analytics?
[x] Comparing different OLAP databases (e.g. Clickhouse) on the tradeoffs
[ ] Referencing different comparable architectures (e.g. PostHog)
[ ] Transition plan from our current indexing pipeline. (e.g. how do collectors pipe into this?)
[ ] How can we write pipelines/transforms as code (e.g. dbt)?
[ ] Cost estimates
[ ] Access control - how can we give people in the data collective insert permissions or query permissions (with proper cost sharing)

Is there any context that might help us understand?

Based on these an initial proposal https://posthog.com/docs/how-posthog-works/clickhouse https://clickhouse.com/docs/en/migrations/bigquery

Clickhouse instead of Timescale for OLAP queries over the event table
Either CloudSQL or Supabase for OLTP workloads (replicated into Clickhouse)
BigQuery for raw data and data transforms (via dbt), then import event table into Clickhouse to service queries

Does the docs page already exist? Please link to it.

No response

ravenac95 commented 11 months ago

Some thoughts:

Data collection should be language agnostic. We should instead write any data collection to csv/parquet/json and import that into a datastore to then do dbt transformations. This will speed up data collection and will decouple the data collection from database errors/latency/etc. Also this will hopefully make it easier to suss out issues with data collection as each discrete step we could track errors. Should hopefully make collaboration easier as well.
dbt seems like the proper thing to use for data transformation
ORMs aren't useful for the OLAP queries and data ingestion in general. It's quite limiting when we need to do massive queries so I'd say it's probably time for us to simply use a migration framework. That being said, I'm fine with maintaining typeorm as that migration framework.
CloudSQL or supabase are fine. There's a preference for cloudsql due to billing consolidation into GCP and also to prevent any additional egress costs (if we need to go from OLAP -> OLTP). In bound data transfer would be free.
- My concern is mostly around tables need to go from database to database. At the moment collections/projects aren't so large so it's not the biggest issue. However, if they do grow much larger, it would be best to have the databases we run be colocated to reduce costs.

Other things I've found so far:

This is a point in the direction of clickhouse: https://blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-queries-per-second/

ravenac95 commented 11 months ago

How we can get blockchain transactions/traces for other blockchains (e.g. cryo vs substreams)?

At this time probably both could work for us. However, in either case we will need infrastructure to run the nodes in question or pay for access to alchemy or equivalent. It's possible we could live on free tiers but that seems unlikely if we want to get dumps of multiple chains. Additionally, in order to do historical analysis, we would likely exceed any free tiers available at alchemy/infura/etc. The issue is mostly that Google's public dataset does not have traces. If that were available we'd be able to do everything we are currently interested in.

All this being said, likely, cryo is the method we should use at this time. It doesn't need built in support like substreams do. Substreams (particularly the firehouse component required for substreams) is likely better in the long term but isn't viable for something like optimism until firehouse is implemented in op-geth.

ravenac95 commented 11 months ago

How to reconcile event data from a product analytics tool (e.g. Amplitude) and on-chain user analytics?

This would have to be on a tool by tool basis. I would however propose that if a webhook of streaming events is available from such a tool (and we can properly determine a way to whitelist), that is how we accept inbound data. That would be the most "trustable" way to do it assuming we would be able to know that the client is a specific service.

This idea needs more research and thought.

ravenac95 commented 11 months ago

Comparing different OLAP databases (e.g. Clickhouse) on the tradeoffs

From my still somewhat limited research some of the options for such things are as follows:

BigQuery
Redshift
Clickhouse
Druid
Pinot
Starrocks

At this time, there are only really 3 viable options for us to deploy: Clickhouse, Redshift, or Bigquery. Sadly, the other OSS OLAP DBs would require our own infrastructure management which, at the time, would be too much additional work. They may be interesting to look at in the future but for the next 1 or 2 orders of magnitude of scaling, I'd imagine we could survive with the 3 currently viable options.

Cost Overview

BigQuery
- Bigquery has 2 operating modes.
  - On-demand
    - Pricing is based on the amount of data scanned at a rate of $6.25/TB. It is hard to predict, however, things like the deps.dev database is ~250GB for any given data segment. This method of usage is extremely cost prohibitive if used as our active database (the table below will show)
  - Capacity
    - Capacity is a sold by the "slot" which is noted as a virtual CPU. It is unclear how this translates to performance without further testing.
Redshift
- This is billed in 2 different ways depending on deployment
  - Serverless:
    - This doesn't maintain a live database at all times. It's charged on a per hour of compute for queries and separately on total storage. Costs are ~$0.36/hr of compute and ~0.024/GB/Month of storage.
  - On-demand
    - This is if we wish to run a continuous server that we could query against. This is just paying for a private hosted redshift.
Clickhouse
- Pricing for their "Production" plan is $47.10/TB/Month for storage and 0.6888/hr for compute (compute idle times are not calculated as part of the costs)

Cost comparison table

The numbers here are fairly arbitrary but are made to give some standard for comparison. Compute/Query times are based on 1000 queries/day/30 day month where each query is ~3s and every query does a worst case scan of all data in a 1TB db

For Bigquery Capacity slots we assume we reserve 6 vCPUs at 0.066/hr on-demand to match 6 vCPUs from clickhouse

For Redshift serverless we assume 3 RPUs (each RPU is 2 vCPUs supposedly)

Service	Storage Cost (1TB/Month)	Server Costs (720 Hrs)	Query Costs (1000query/day/month)	Total Monthly Costs
BigQuery (On Demand)	$20	$0	$187,500.00	$187,520.00
BigQuery (Capacity)	$20	$0	$198.00 (minimum billing is 1min)	$218.00
Redshift (Serverless)	$24	$0	$360	$380.00
Redshift (On-Demand)	$0 (limited to 64TB)	$180	$0	$180.00
Clickhouse	$47.10	$0	$344.00	$391.10

Scaling

At this time, scaling all of the systems is likely similar. There are various dimensions we can use for any given system but for sheer capacity all of the systems have ways for us to pay more to get more.

Interfaces

All of the systems provide a SQL or SQL-like interface. Clickhouse has the additional benefit of providing a postgres endpoint that provides a postgres interface (this is potentially helpful in cases where we need to ingest dumps from other postgres dbs like our own current db). Redshift and BigQuery have great support for postgres.

Open Source

Redshift/BigQuery are both closed source services. We would be locked in if we used any specific feature of either service.

Clickhouse are all open source.

Performance

According to PostHog, Clickhouse that is properly tuned exceeds performance of redshift or bigquery.
Cloudflare uses clickhouse for it's millions of DNS events as well and also has great notes on usage. I would imagine if it's good enough for them it's good enough for us as well.

Sources

ravenac95 commented 11 months ago

Proposal

this is still wip

Overview

Data warehouse architectural overview

Taking inspiration from both PostHog and Cloudflare, we will feed the pipeline of data bound for Clickhouse in batches. If the data we need is not already available as a BigQuery public dataset, data is generated by collectors and uploaded as csv/json/parquet files into GCS. Our process doesn't require realtime feeds so we can load data directly from GCS without the need for something like Kafka to periodically manage batching. At least at this time, our workloads are generally predictable happening at regular cron intervals and writes for all data should complete within the hour that data is uploaded into GCS. Once data has been loaded into GCS, the next step in the pipeline would be to load the data into both Clickhouse and BigQuery. Data is duplicated into BigQuery to allow for public querying of the data we collect. The separation of the public dataset from our own internal clickhouse server allows us to ensure high throughput scaling for the future. Finally, once data has been loaded into Clickhouse we can run any manner of dbt transformations on the data for use when making queries from any frontend (api, ui, etc.) clients.

Seeding the open data collective

The data we load into bigquery is the start of our open data collective. Along with the public dataset itself a set of documentation will provide detailed descriptions of the columns available in the dataset. Documentation and schemata should be versioned.

Cost Estimates

TBD

Alternatives

ravenac95 commented 10 months ago

So... I'm not going to edit the previous comment here though my assertions there have changed fairly significantly.

Some new things discovered/learned/ideas changed:

I had briefly discussed with Ray that slot reservations would potentially run up our bill however, it seems there's a way to use the slots in a way that is PAYG. I blame google for naming the mechanism "reservations" to be honest. I am now, again, of the belief that we can do this fairly easily with just google bigquery upon further reading. I do believe we should run real tests with the reservations.
The diagram above depicts clickhouse in the loop. After discussing with Ray this doesn't seem necessary. Instead we would simply have a CloudSQL instance to query already materialized views from BQ data or handle other OLTP like workloads we might need. Essentially, we'd want to still keep the costs of BQ utilization fairly low. Perhaps once every hour at most to run our materializations for things like github.
In my research, I don't believe there should be egress costs going from BQ to CloudSQL (given that they're in the same region). Additionally, if the streaming API gets too expensive it's possible for us to use Temporary Tables (though this isn't immediately supported by dbt) so that any streaming read of the tables is actually free. At this time I see this as an over optimization.
GH Archive data from it's start date, 2011-02-12, to 2022-12-31 is 15TB. If we're using slots I think this figure is no longer that useful, but I did some data gathering on this in case we needed to know. That being said, this could be useful to understand the scale of the materializations we make are going to be.
I stupidly assumed dbt was actually able to do the moving of data. Simply my own ignorance and in using DBT realized that this wasn't possible. However, @davidgasquez opened my eyes to the wonderful world of airbyte (this is open core), which I think would fit quite nicely with what we really intend. Out of the box we could do much of the data moving that we wish with already written data connectors. We could then also easily turn our current collectors into those data connectors. The large community of contributed connectors is my favorite part about airbyte which I believe would eliminate some pieces of the diagram I had from last week.
- We'd still need something to do the scheduling of these connectors, we could simply use dagster as a service and be done with it, but if we hope to continue to use github actions (and I'm thinking up some wild ideas for this as a way to share infra) then we would want to host dagster with a custom "Run Launcher"
A lot of wonderful devops work has been done for the entire data pipeline. This gives me some comfort that I'll be familiar with how things will tie together.

Apologies, still been doing a lot of digging, I will articulate more in a larger/separate message. This is just a braindump that helps to understand some of the decisions with the diagrams for the architecture I'll propose

ravenac95 commented 10 months ago

This just needs a slight readjustment of the architecture but in general this is up to date. I will update that tomorrow.

ravenac95 commented 9 months ago

Here is now the revised general architecture for what we envision:

Blank diagram (6)

The main revision is how we're thinking about how all of the data is moved around (cloudquery instead of airbyte) and the removal of clickhouse from the diagram. There might be a case to use clickhouse, but for simplicity's sake we are not going to build against an architecture dependent on it.

ravenac95 commented 9 months ago

Closing this issue for now but will include most of this in the documentation.

davidgasquez commented 9 months ago

Awesome!

Diagram looks on point and there is always room to add Clickhouse or other destinations!

opensource-observer / oso