opensource-observer / oso

Measuring the impact of open source software
https://opensource.observer
Apache License 2.0
71 stars 16 forks source link

Docs: Design doc for data warehouse #659

Closed ryscheng closed 9 months ago

ryscheng commented 11 months ago

What is the improvement or update you wish to see?

Not looking for a proper design doc (bullet points fine), but we should at least put a little more thought into it to include:

Is there any context that might help us understand?

Based on these an initial proposal https://posthog.com/docs/how-posthog-works/clickhouse https://clickhouse.com/docs/en/migrations/bigquery

Does the docs page already exist? Please link to it.

No response

ravenac95 commented 11 months ago

Some thoughts:

Other things I've found so far:

ravenac95 commented 11 months ago

How we can get blockchain transactions/traces for other blockchains (e.g. cryo vs substreams)?

At this time probably both could work for us. However, in either case we will need infrastructure to run the nodes in question or pay for access to alchemy or equivalent. It's possible we could live on free tiers but that seems unlikely if we want to get dumps of multiple chains. Additionally, in order to do historical analysis, we would likely exceed any free tiers available at alchemy/infura/etc. The issue is mostly that Google's public dataset does not have traces. If that were available we'd be able to do everything we are currently interested in.

All this being said, likely, cryo is the method we should use at this time. It doesn't need built in support like substreams do. Substreams (particularly the firehouse component required for substreams) is likely better in the long term but isn't viable for something like optimism until firehouse is implemented in op-geth.

ravenac95 commented 11 months ago

How to reconcile event data from a product analytics tool (e.g. Amplitude) and on-chain user analytics?

This would have to be on a tool by tool basis. I would however propose that if a webhook of streaming events is available from such a tool (and we can properly determine a way to whitelist), that is how we accept inbound data. That would be the most "trustable" way to do it assuming we would be able to know that the client is a specific service.

This idea needs more research and thought.

ravenac95 commented 11 months ago

Comparing different OLAP databases (e.g. Clickhouse) on the tradeoffs

From my still somewhat limited research some of the options for such things are as follows:

At this time, there are only really 3 viable options for us to deploy: Clickhouse, Redshift, or Bigquery. Sadly, the other OSS OLAP DBs would require our own infrastructure management which, at the time, would be too much additional work. They may be interesting to look at in the future but for the next 1 or 2 orders of magnitude of scaling, I'd imagine we could survive with the 3 currently viable options.

Cost Overview

Cost comparison table

The numbers here are fairly arbitrary but are made to give some standard for comparison. Compute/Query times are based on 1000 queries/day/30 day month where each query is ~3s and every query does a worst case scan of all data in a 1TB db

For Bigquery Capacity slots we assume we reserve 6 vCPUs at 0.066/hr on-demand to match 6 vCPUs from clickhouse

For Redshift serverless we assume 3 RPUs (each RPU is 2 vCPUs supposedly)

Service Storage Cost (1TB/Month) Server Costs (720 Hrs) Query Costs (1000query/day/month) Total Monthly Costs
BigQuery (On Demand) $20 $0 $187,500.00 $187,520.00
BigQuery (Capacity) $20 $0 $198.00 (minimum billing is 1min) $218.00
Redshift (Serverless) $24 $0 $360 $380.00
Redshift (On-Demand) $0 (limited to 64TB) $180 $0 $180.00
Clickhouse $47.10 $0 $344.00 $391.10

Scaling

At this time, scaling all of the systems is likely similar. There are various dimensions we can use for any given system but for sheer capacity all of the systems have ways for us to pay more to get more.

Interfaces

All of the systems provide a SQL or SQL-like interface. Clickhouse has the additional benefit of providing a postgres endpoint that provides a postgres interface (this is potentially helpful in cases where we need to ingest dumps from other postgres dbs like our own current db). Redshift and BigQuery have great support for postgres.

Open Source

Redshift/BigQuery are both closed source services. We would be locked in if we used any specific feature of either service.

Clickhouse are all open source.

Performance

Sources

ravenac95 commented 11 months ago

Proposal

this is still wip

Overview

Data warehouse architectural overview

Taking inspiration from both PostHog and Cloudflare, we will feed the pipeline of data bound for Clickhouse in batches. If the data we need is not already available as a BigQuery public dataset, data is generated by collectors and uploaded as csv/json/parquet files into GCS. Our process doesn't require realtime feeds so we can load data directly from GCS without the need for something like Kafka to periodically manage batching. At least at this time, our workloads are generally predictable happening at regular cron intervals and writes for all data should complete within the hour that data is uploaded into GCS. Once data has been loaded into GCS, the next step in the pipeline would be to load the data into both Clickhouse and BigQuery. Data is duplicated into BigQuery to allow for public querying of the data we collect. The separation of the public dataset from our own internal clickhouse server allows us to ensure high throughput scaling for the future. Finally, once data has been loaded into Clickhouse we can run any manner of dbt transformations on the data for use when making queries from any frontend (api, ui, etc.) clients.

Seeding the open data collective

The data we load into bigquery is the start of our open data collective. Along with the public dataset itself a set of documentation will provide detailed descriptions of the columns available in the dataset. Documentation and schemata should be versioned.

Cost Estimates

TBD

Alternatives

ravenac95 commented 10 months ago

So... I'm not going to edit the previous comment here though my assertions there have changed fairly significantly.

Some new things discovered/learned/ideas changed:

Apologies, still been doing a lot of digging, I will articulate more in a larger/separate message. This is just a braindump that helps to understand some of the decisions with the diagrams for the architecture I'll propose

ravenac95 commented 10 months ago

This just needs a slight readjustment of the architecture but in general this is up to date. I will update that tomorrow.

ravenac95 commented 9 months ago

Here is now the revised general architecture for what we envision:

Blank diagram (6)

The main revision is how we're thinking about how all of the data is moved around (cloudquery instead of airbyte) and the removal of clickhouse from the diagram. There might be a case to use clickhouse, but for simplicity's sake we are not going to build against an architecture dependent on it.

ravenac95 commented 9 months ago

Closing this issue for now but will include most of this in the documentation.

davidgasquez commented 9 months ago

Awesome!

Diagram looks on point and there is always room to add Clickhouse or other destinations!