opensource-observer / oso

Measuring the impact of open source software
https://opensource.observer
Apache License 2.0
73 stars 16 forks source link

Custom sqlmesh materialization for timeseries metrics on trino #2473

Open ravenac95 opened 2 days ago

ravenac95 commented 2 days ago

What is it?

In order to support the Metrics Calculation Service's workflow of writing direct to gcs it's calculated metrics, we need to have a different sqlmesh materialization than currently available for trino.

During the factories model generation, if the current engine is “trino” then we will use this new custom materialization. This custom materialization has this general process:

  1. Receives a generated sql query from the model
  2. Generates a temporary destination path in gcs to receive parquet files
    • This should include a date and be something like {bucket}/{some_static_temp_path}/{date}/{random_id}. This will ensure that we can clean any files that may be left over from runs that don’t complete successfully.
  3. Submits the query to metrics calculation server
  4. Polls for the calculation job to be completed every 5 seconds
  5. Triggers a trino query to import the parquet files into iceberg storage
  6. Deletes the temporary destination path in gcs

NOTE: If using duckdb we should just use the current workflow as that will work without issue and doesn't require the metrics calculation service