This is a meta issue for a few smaller issues. I wasn't sure how to best structure this so it's all in a larger issue with the smaller issues attached. This provides a general overview of the design and the process to get us to a completed time series metrics system that is consistently running in an efficient manner
Much of this is derived from the prototype work done in #2469.
Architecture
Background
Time series metrics, in theory, is very straightforward. Luckily, the models to write the metrics themselves are generally simple, but due to the level of analysis that we would like to achieve, doing so for many projects at a large scale (and continuously) has posed to be fairly difficult for some types of metrics. Metrics over a predefined rolling window have proven to be quite difficult. For clarity, our definition of this rolling window metric is one that runs an aggregation over a specified time window at a regular interval. Our previous attempts with bigquery resulted in queries that were too large to fit into memory with bigquery and exceedingly storage inefficient due to the creation of additional rows for data that would be best left out for some periods of time but instead appears at every iteration of the window calculation.
We've gone through a fairly large evolution in thinking on how to make this all work correctly, but we did attempt at least the following things:
Switching to SQLMesh for metrics calculation
While the correct decision, it was not possible, on it's own, to solve the problem with executing these metrics.
Using clickhouse
This also ended up being the wrong method as clickhouse did not like the way sqlmesh was interacting with it. Nor did it have any built in mechanisms to deal with the rolling window calculation.
Using trino
This worked better than clickhouse (the process can complete) but the time required was so long due to a variety of issues which will be discussed in this issue.
Using a local duckdb
This works and can calculate for the existing metrics as of 2024-11-19 within ~12-13 hours for the entire backfill. It, however, isn't setup to run in our IaC setup and needs a lot of manual work to setup the local duckdb with the correct data to run the calculations.
Clustered dask + local duckdb
This proved that given enough systems working in parallel we can calculate all rows for a given metric's backfill within minutes. There were however issues see: #2469
After having explored these things we've arrived at the solution detailed in this issue that combines a some of the explored solutions above to provide a theoretically working solution to both the continuous calculation and deployment of these metrics with our sqlmesh setup.
Overview
As discussed in the background section, we are piecing together some of the researched options to provide an end to end experience that can execute quickly and continuously. The solution is to use sqlmesh, iceberg, duckdb, and dask together as described in the following sequence diagram.
Sequence Diagram Components:
Dagster - Our Dagster infrastructure
SQLMesh - This is an instance of the sqlmesh running (it only runs a single instance at once)
Metrics Frontend API - The frontend to a metrics calculation service.
Metrics Worker Cluster - A set of dask managed worker instances that use a locally stored duckdb to horizontally scale the metrics calculations. The local duckdb instances contain dependent tables that are pre-warmed as described in the diagram
TrinoDB - This is our trino infrastructure
Iceberg Data Lake (SDK) - This is just the interface to our cloud storage as an iceberg datalake. Depending on the language used this will have different performance
Cloud storage - This is our gcs bucket (or maybe some other cloud storage in the future)
This is a meta issue for a few smaller issues. I wasn't sure how to best structure this so it's all in a larger issue with the smaller issues attached. This provides a general overview of the design and the process to get us to a completed time series metrics system that is consistently running in an efficient manner
Much of this is derived from the prototype work done in #2469.
Architecture
Background
Time series metrics, in theory, is very straightforward. Luckily, the models to write the metrics themselves are generally simple, but due to the level of analysis that we would like to achieve, doing so for many projects at a large scale (and continuously) has posed to be fairly difficult for some types of metrics. Metrics over a predefined rolling window have proven to be quite difficult. For clarity, our definition of this rolling window metric is one that runs an aggregation over a specified time window at a regular interval. Our previous attempts with bigquery resulted in queries that were too large to fit into memory with bigquery and exceedingly storage inefficient due to the creation of additional rows for data that would be best left out for some periods of time but instead appears at every iteration of the window calculation.
We've gone through a fairly large evolution in thinking on how to make this all work correctly, but we did attempt at least the following things:
After having explored these things we've arrived at the solution detailed in this issue that combines a some of the explored solutions above to provide a theoretically working solution to both the continuous calculation and deployment of these metrics with our sqlmesh setup.
Overview
As discussed in the background section, we are piecing together some of the researched options to provide an end to end experience that can execute quickly and continuously. The solution is to use sqlmesh, iceberg, duckdb, and dask together as described in the following sequence diagram.
Sequence Diagram Components:
Sequence diagram
Issues
2470
2472
2473
2474
2479