Open ryscheng opened 3 months ago
Moving this up based on discussions today. Here's what we decided to experiment with for the time being:
Some trade-offs:
Gut feeling, CH is probably the better long-term solution for exploratory analysis and dynamic queries and data science and data pipelines. We'd need to get sponsors to pay for data storage and setup user metering/billing for compute, which is more work compared to BQ, but cleaner in the long-run. It may be the case we only have access to intermediate models for this, and not all raw data.
For data sharing, I'm not aware of anything better than BigQuery for permissionless data export. You can reference any data set you want from your own account and either run queries or export to whatever you want yourself without help from us
Just to set some bounds, it's likely the case that we won't move 100% to everything on Clickhouse. Their only method for distributed queries is explicit sharding. It's not a flexible distributed query system with query optimization, so from a compute side, you start to lose some of the benefits of putting all your storage in a flexible layer like GCS/S3. https://clickhouse.com/docs/en/engines/table-engines/special/distributed
BigQuery is much better for these wide data scans joining across large datasets from a performance perspective.
https://clickhouse.com/docs/en/guides/sizing-and-hardware-recommendations
"For use cases with long retention periods for your data or with high data volumes, we recommend a 1:100 to 1:130 memory to storage ratio. For example, 100GB of RAM per replica if you are storing 10TB of data."
After call with Zeyad, more confident that Clickhouse probably isn't the end all-be-all, it's for more snappy OLAP queries (e.g. to power a front-end). it'll scale surprisingly well, but probably not to the degree of all of our raw data sets.
I think we should stick with BigQuery for our data pipeline for now and definitely to store the datasets. If we need cheaper compute in the future, we can explore Spark or Trino or something like that. Clickhouse is still a great option to power a front-end in some way, esp if we can design a good intermediate model (e.g. generalized event table) that's smaller and sufficiently multi-dimensional for exploration
Deprioritizing for now
FWIW, Clickhouse has GHArchive as a public dataset https://ghe.clickhouse.tech/
the website shows some pretty impressive benchmarks
But keep in mind they also pre-process the original GHArchive dataset into a generic event table first, shown here https://github.com/ClickHouse/github-explorer https://ghe.clickhouse.tech/#download-the-dataset-directly-into-clickhouse
What is it?
Moving more queries to Clickhouse would be amazing. Here's a guide for the transition: https://clickhouse.com/blog/clickhouse-bigquery-migrating-data-for-realtime-queries
Open question as to what data is moved over there and what queries would we be trying to support?