opensource-observer / oso

Measuring the impact of open source software
https://opensource.observer
Apache License 2.0
73 stars 16 forks source link

SQLMesh Time series metrics #2471

Open ravenac95 opened 2 days ago

ravenac95 commented 2 days ago

This is a meta issue for a few smaller issues. I wasn't sure how to best structure this so it's all in a larger issue with the smaller issues attached. This provides a general overview of the design and the process to get us to a completed time series metrics system that is consistently running in an efficient manner

Much of this is derived from the prototype work done in #2469.

Architecture

Background

Time series metrics, in theory, is very straightforward. Luckily, the models to write the metrics themselves are generally simple, but due to the level of analysis that we would like to achieve, doing so for many projects at a large scale (and continuously) has posed to be fairly difficult for some types of metrics. Metrics over a predefined rolling window have proven to be quite difficult. For clarity, our definition of this rolling window metric is one that runs an aggregation over a specified time window at a regular interval. Our previous attempts with bigquery resulted in queries that were too large to fit into memory with bigquery and exceedingly storage inefficient due to the creation of additional rows for data that would be best left out for some periods of time but instead appears at every iteration of the window calculation.

We've gone through a fairly large evolution in thinking on how to make this all work correctly, but we did attempt at least the following things:

After having explored these things we've arrived at the solution detailed in this issue that combines a some of the explored solutions above to provide a theoretically working solution to both the continuous calculation and deployment of these metrics with our sqlmesh setup.

Overview

As discussed in the background section, we are piecing together some of the researched options to provide an end to end experience that can execute quickly and continuously. The solution is to use sqlmesh, iceberg, duckdb, and dask together as described in the following sequence diagram.

Sequence Diagram Components:

Sequence diagram

Issues