opensource-observer / oso

Measuring the impact of open source software
https://opensource.observer
Apache License 2.0
49 stars 13 forks source link

Clickhouse for OLAP queries #1002

Open ryscheng opened 3 months ago

ryscheng commented 3 months ago

What is it?

Moving more queries to Clickhouse would be amazing. Here's a guide for the transition: https://clickhouse.com/blog/clickhouse-bigquery-migrating-data-for-realtime-queries

Open question as to what data is moved over there and what queries would we be trying to support?

ryscheng commented 3 months ago

Moving this up based on discussions today. Here's what we decided to experiment with for the time being:

Some trade-offs:

Gut feeling, CH is probably the better long-term solution for exploratory analysis and dynamic queries and data science and data pipelines. We'd need to get sponsors to pay for data storage and setup user metering/billing for compute, which is more work compared to BQ, but cleaner in the long-run. It may be the case we only have access to intermediate models for this, and not all raw data.

For data sharing, I'm not aware of anything better than BigQuery for permissionless data export. You can reference any data set you want from your own account and either run queries or export to whatever you want yourself without help from us

ryscheng commented 3 months ago

Just to set some bounds, it's likely the case that we won't move 100% to everything on Clickhouse. Their only method for distributed queries is explicit sharding. It's not a flexible distributed query system with query optimization, so from a compute side, you start to lose some of the benefits of putting all your storage in a flexible layer like GCS/S3. https://clickhouse.com/docs/en/engines/table-engines/special/distributed

BigQuery is much better for these wide data scans joining across large datasets from a performance perspective.

ryscheng commented 3 months ago

https://clickhouse.com/docs/en/guides/sizing-and-hardware-recommendations

"For use cases with long retention periods for your data or with high data volumes, we recommend a 1:100 to 1:130 memory to storage ratio. For example, 100GB of RAM per replica if you are storing 10TB of data."

After call with Zeyad, more confident that Clickhouse probably isn't the end all-be-all, it's for more snappy OLAP queries (e.g. to power a front-end). it'll scale surprisingly well, but probably not to the degree of all of our raw data sets.

I think we should stick with BigQuery for our data pipeline for now and definitely to store the datasets. If we need cheaper compute in the future, we can explore Spark or Trino or something like that. Clickhouse is still a great option to power a front-end in some way, esp if we can design a good intermediate model (e.g. generalized event table) that's smaller and sufficiently multi-dimensional for exploration

Deprioritizing for now

ryscheng commented 3 days ago

FWIW, Clickhouse has GHArchive as a public dataset https://ghe.clickhouse.tech/

the website shows some pretty impressive benchmarks

But keep in mind they also pre-process the original GHArchive dataset into a generic event table first, shown here https://github.com/ClickHouse/github-explorer https://ghe.clickhouse.tech/#download-the-dataset-directly-into-clickhouse