Clickhouse for OLAP queries

ryscheng commented 3 months ago

What is it?

Moving more queries to Clickhouse would be amazing. Here's a guide for the transition: https://clickhouse.com/blog/clickhouse-bigquery-migrating-data-for-realtime-queries

Open question as to what data is moved over there and what queries would we be trying to support?

Would it be crazy to move all of our data to clickhouse away from BigQuery?
Would it make sense to just move certain intermediate models? Like the event table?

ryscheng commented 3 months ago

Moving this up based on discussions today. Here's what we decided to experiment with for the time being:

Let's try just using off-the shelf Goldsky mirror to Clickhouse https://github.com/opensource-observer/oso/issues/1163 Why? This should speed up our ability to get traces somewhere faster
Let's create a new dbt pipeline to cover data that starts in Clickhouse https://github.com/opensource-observer/oso/issues/1165 Why? dbt only supports a single warehouse per pipeline
Let's setup Airbyte jobs to move data between Clickhouse <> BigQuery https://github.com/opensource-observer/oso/issues/1164 Why? We don't know which one we want to use more yet. Let's keep it flexible and experimental

Some trade-offs:

Storage is cheaper on BQ, compute is cheaper on CH
BQ data sharing is easy and mature. CH we need to give people access to our database instance, others can't just refer to our published dataset in their own instance. This makes billing messy, as we'd pay for all compute and network resources consumed. BQ lets you offload that to the consumer.

Gut feeling, CH is probably the better long-term solution for exploratory analysis and dynamic queries and data science and data pipelines. We'd need to get sponsors to pay for data storage and setup user metering/billing for compute, which is more work compared to BQ, but cleaner in the long-run. It may be the case we only have access to intermediate models for this, and not all raw data.

For data sharing, I'm not aware of anything better than BigQuery for permissionless data export. You can reference any data set you want from your own account and either run queries or export to whatever you want yourself without help from us

ryscheng commented 3 months ago

Just to set some bounds, it's likely the case that we won't move 100% to everything on Clickhouse. Their only method for distributed queries is explicit sharding. It's not a flexible distributed query system with query optimization, so from a compute side, you start to lose some of the benefits of putting all your storage in a flexible layer like GCS/S3. https://clickhouse.com/docs/en/engines/table-engines/special/distributed

BigQuery is much better for these wide data scans joining across large datasets from a performance perspective.

ryscheng commented 3 months ago

https://clickhouse.com/docs/en/guides/sizing-and-hardware-recommendations

"For use cases with long retention periods for your data or with high data volumes, we recommend a 1:100 to 1:130 memory to storage ratio. For example, 100GB of RAM per replica if you are storing 10TB of data."

After call with Zeyad, more confident that Clickhouse probably isn't the end all-be-all, it's for more snappy OLAP queries (e.g. to power a front-end). it'll scale surprisingly well, but probably not to the degree of all of our raw data sets.

I think we should stick with BigQuery for our data pipeline for now and definitely to store the datasets. If we need cheaper compute in the future, we can explore Spark or Trino or something like that. Clickhouse is still a great option to power a front-end in some way, esp if we can design a good intermediate model (e.g. generalized event table) that's smaller and sufficiently multi-dimensional for exploration

Deprioritizing for now

ryscheng commented 3 days ago

FWIW, Clickhouse has GHArchive as a public dataset https://ghe.clickhouse.tech/

the website shows some pretty impressive benchmarks

But keep in mind they also pre-process the original GHArchive dataset into a generic event table first, shown here https://github.com/ClickHouse/github-explorer https://ghe.clickhouse.tech/#download-the-dataset-directly-into-clickhouse

opensource-observer / oso

Clickhouse for OLAP queries #1002

What is it?