mozilla / metric-hub

Central hub for metric definitions that are considered the source of truth
https://mozilla.github.io/metric-hub
Mozilla Public License 2.0
13 stars 13 forks source link

Clarify source naming; update link to generated SQL #483

Closed fbertsch closed 3 months ago

fbertsch commented 3 months ago

We were using the source name active_users_aggregates_v1, which actually pointed to a view. This now clarifies that we're pointing to the view, and filtering to just desktop.

Also updated the link in the metric definition to point to the generated source query.

scholtzan commented 3 months ago

This file seems to reference that data source as well: https://github.com/mozilla/metric-hub/blob/a5da1569fa8f5df56c4b6c8e52e173232aedf1f5/looker/definitions/firefox_desktop.toml#L88

bochocki commented 3 months ago

Looks reasonable to me so I'll approve.

One thing I was planning to tweak in the near future is probably removing the second paragraph of a lot of those descriptions (i.e., the ones starting "For more information..." and then linking to confluence and then listing email addresses) since we already link to Confluence in the first part of every description edited here (I think) and if you use the owner field then those display nicely in the right hand sidebar of DataHub. Not sure if it's easier/cleaner to remove them now or I can do it later in a separate PR with some other tweaks.

For my own knowledge, where do the looker/definitions show up? Are they displayed or viewable in Looker somewhere? Or are they just used to reference metric definitions from Looker?

It's a good idea to make these descriptions more succinct. Since it sounds like these are mostly proposed changes to documentation and less likely to introduce breaking changes, let's follow-up on them afterwards.

jmsilverman commented 3 months ago

@bochocki with Frank out for a couple weeks do you think we're good to merge this? Or should we wait for @scholtzan to respond to Frank's comment from a few days ago?

bochocki commented 3 months ago

I think maybe it could help to wait for @scholtzan.

bochocki commented 3 months ago

I'm going to go ahead and merge; I wrote a short script for a recent PR that aims to validate specific Metric Hub definitions. I wrote a similar script for this case (below). I verified that it runs before the change, and will also check it after the merge.

The script doesn't test the looker and opmon directories, but those are less critical paths and the changes are all similar. We can also check the OpMon dashboards tomorrow to make sure they're still running as expected. I think those dashboards are:

!pip install mozanalysis -qq
import pandas as pd

from google.cloud import bigquery
from google.colab import auth
from mozanalysis.config import ConfigLoader
from textwrap import dedent

auth.authenticate_user()

def fetch_metric_hub_definition(start_date, end_date, metric_slug, app_name, bq_project):
    start_date = pd.to_datetime(start_date).date()
    end_date = pd.to_datetime(end_date).date()

    # Set useful attributes based on the Metric Hub definition
    metric = ConfigLoader.get_metric(metric_slug=metric_slug, app_name=app_name)
    submission_date_column = metric.data_source.submission_date_column

    # Modify the metric source table string so that it formats nicely in the query.
    from_expression = metric.data_source._from_expr.replace("\n", "\n" + " " * 15)

    query = dedent(
        f"""
        SELECT {submission_date_column},
               {metric.select_expr} AS value
          FROM {from_expression}
         WHERE {submission_date_column} BETWEEN '{start_date}' AND '{end_date}'
         GROUP BY {submission_date_column}
        """
    )

    print(
        "\n",
        "\n" + "-" * 110,
        "\n" + f" Querying for '{app_name}.{metric_slug}' ".center(110, "-"),
        f"\n> data_source: {metric.data_source.name}"
        f"\n{query}"
    )
    df = bigquery.Client(project=bq_project).query(query).to_dataframe()

    # ensure submission_date has type 'date'
    df[submission_date_column] = pd.to_datetime(df[submission_date_column]).dt.date

    return df
definitions = {
    "firefox_desktop": ["daily_active_users_v2", "desktop_dau_kpi_v2"],
    "multi_product": ["mobile_daily_active_users_v1", "mobile_dau_kpi_v1"],
    "fenix": ["daily_active_users_v2"],
    "firefox_ios": ["daily_active_users_v2"],
    "focus_android": ["daily_active_users_v2"],
    "focus_ios": ["daily_active_users_v2"],
}

for app_name, metric_slugs in definitions.items():
    for metric_slug in metric_slugs:
        df = fetch_metric_hub_definition(
            start_date="2024-03-14",
            end_date="2024-03-21",
            metric_slug=metric_slug,
            app_name=app_name,
            bq_project="moz-fx-data-bq-data-science"
        )
        display(df.set_index(df.columns[0]).transpose())
bochocki commented 3 months ago

I'm going to go ahead and merge; I wrote https://github.com/mozilla/metric-hub/pull/523#issuecomment-2237524578 that aims to validate specific Metric Hub definitions. I wrote a similar script for this case (below). I verified that it runs before the change, and will also check it after the merge.

This appears to work as expected; all of the definitions can still be queried and are using the updated data sources.

bochocki commented 3 months ago

I also triggered the Airflow KPI forecasting task, which is probably the highest-profile asset impacted by these changes. The task completed as expected.

bochocki commented 2 months ago

The OpMon dashboards are running the day after this merge. Seems like everything works as expected.