Key metrics model in sqlmesh

opensource-observer / oso

Measuring the impact of open source software

https://opensource.observer

Apache License 2.0

71 stars 16 forks source link

Key metrics model in sqlmesh #2268

Open ryscheng opened 1 week ago

ryscheng commented 1 week ago

What is it?

Currently we have these legacy code_metrics_by_project, and onchain_metrics_by_project computed in dbt.

With the new metrics factory on sqlmesh, we should create new models. Here's my proposal: We just create 3 marts:

key_metrics_by_artifact
key_metrics_by_project
key_metrics_by_collection

The schema for each of these should look like this:

metric_id
artifact_id, project_id, or collection_id
amount
unit

This is not a time series metric. Instead, each metric_id has a custom way to aggregate all data. For example for commits it might be a sum. For first_commit, it might be just grabbing the earliest date.

This can power the new MetricPage https://github.com/opensource-observer/oso/issues/27 as well as the table in the CollectionPage

After this exists, we can deprecate the older models on dbt.

ccerv1 commented 6 days ago

Yes! I call these "point-in-time" metrics here #2275.

We should also consider adding a sample_date field that represents the last date in the relevant event data.

For example, from this query we can see a lot of issues with the time intervals currently used in our models:

select
    m.metric_name,
    max(t.amount),
    min(t.sample_date) as first_date,
    max(t.sample_date) as last_date
from metrics.timeseries_metrics_by_project_v0 as t
join metrics.metrics_v0 as m
    on t.metric_id = m.metric_id
join default.projects_v1 as p
    on t.project_id = p.project_id
where p.project_name = 'uniswap'  
group by 1
order by 4

ryscheng commented 6 days ago

I think the reason I'm suggesting a new model is because we might want to pre-compute the aggregate (total number of contributors, total number of commits, total number of stars) etc. This is different from "point-in-time" metrics, because we want to aggregate over all time.

But yes, I agree we should store the date that we compute the aggregate on.

ryscheng commented 3 days ago

Summary from the conversation today.

Plan is to implement this as a sqlmesh model manually outside of the metrics factory. It should be just a variation of related models from timeseries metrics.

In the future we can file issues for automatically generating this from the metrics factory, but probably not worth it yet.