gouline / dbt-metabase

dbt + Metabase integration
https://pypi.org/project/dbt-metabase/
MIT License
448 stars 64 forks source link

Setting Metabase Metrics using dbt properties #25

Closed remigabillet closed 6 months ago

remigabillet commented 3 years ago

Metabase Metrics (and in particular custom aggregations) are very useful to our end-users. I'm finding myself spending a lot of time in Metabase managing Metrics, the UI is very inconvenient.

I'd like to be able to define a list of Metabase Metrics on DBT models as schema model meta properties. I wonder if this fits this project's scope. The primary challenge is parsing the Metabase expressions which are defined as string (ex: Distinct(case([step] = "hired", [candidate_user_id])) into trees that the API can handle. I think it's easiest to maintain if I compile the JS parser from metabase source code into a small node script that can be committed and called from this project.

This would add a good amount of complexity so it's probably beyond the scope of the project. I thought I would share for the sake of discussion.

z3z1ma commented 3 years ago
export function parseOperators(operands, operators) {
  let initial = operands[0];
  const stack = [];
  for (let i = 1; i < operands.length; i++) {
    const operator = operators[i - 1];
    const operand = operands[i];
    const top = stack[stack.length - 1];
    if (top) {
      if (top[0] === operator) {
        top.push(operand);
      } else {
        const a = OPERATOR_PRECEDENCE[operator];
        const b = OPERATOR_PRECEDENCE[top[0]];
        if (b < a) {
          top[top.length - 1] = [operator, top[top.length - 1], operand];
          stack.push(top[top.length - 1]);
        } else {
          do {
            stack.pop();
          } while (
            stack.length > 0 &&
            OPERATOR_PRECEDENCE[stack[stack.length - 1][0]] > a
          );
          if (stack.length > 0) {
            stack[stack.length - 1].push(operand);
          } else {
            initial = [operator, initial, operand];
            stack.push(initial);
          }
        }
      }
    } else {
      initial = [operator, initial, operand];
      stack.push(initial);
    }
  }
  return initial;
}

export const OPERATOR_PRECEDENCE = {
  not: 17,
  "*": 15,
  "/": 15,
  "+": 14,
  "-": 14,
  and: 6,
  or: 5,
};

So looking at the above compiler, lets look at the output we are looking for. Here we can introspect an existing metric I have created.

[
  {
    "description": "Unit uptime is understood as count of communicating non-impacted panels over communicating panels.",
    "archived": false,
    "table_id": 53,
    "definition": {
      "source-table": 53,
      "aggregation": [
        [
          "aggregation-options",
          [
            "/",
            [
              "sum",
              [
                "field",
                1674,
                null
              ]
            ],
            [
              "sum",
              [
                "field",
                1676,
                null
              ]
            ]
          ],
          {
            "display-name": "kpi_unit_uptime"
          }
        ]
      ]
    },
    "creator": {
      "email": "alex@...",
      "first_name": "Alex",
      "last_login": "...",
      "is_qbnewb": false,
      "is_superuser": true,
      "id": 2,
      "last_name": "...",
      "date_joined": "...",
      "common_name": "..."
    },
    "database_id": 2,
    "show_in_getting_started": false,
    "name": "Unit Uptime",
    "caveats": null,
    "creator_id": 2,
    "updated_at": "...",
    "query_description": {
      "table": "fct_core_unit_uptime_stats",
      "aggregation": [
        {
          "type": "aggregation",
          "arg": "kpi_unit_uptime"
        }
      ]
    },
    "id": 9,
    "how_is_this_calculated": null,
    "created_at": "...",
    "points_of_interest": null
  }
]

With a little deep diving, I am pretty confident we can compile the expected input without needing to use their exact source code. The above code can translate 1 to 1 to python pretty easily.

The challenge is translating yml meaningfully to the expectation whilst making it simple and readable to the end user. Kind of spitballing below but it can get a little complicated if we dont approach it from a readability first perspective. So I guess my end opinion is that the compilation in python isnt as much the constraint in my eyes as is the format for defining these in the yml.

models:
- name: my_model
  # Model level meta tag for cross column metrics?
  meta:
    metrics:
    - name: kpi_profit
      aggregation:
        operands:
        # maybe operand[0] is a nested operation
        - operands:
          - price
          - quantity
          operators:
          - *
        # operand[0] _could_ just be constant / field name too?
        # - order_value
        operators:
        - sum
  columns:
  - name: price
    meta:
      metrics:
      - name: most_expensive
        aggregation: 
          type: max
      - name: inflated_price_sum
        aggregation:
          operands:
          - "self"
          - 2.54
          operators:
          - *
          type: sum
remigabillet commented 3 years ago

@z3z1ma Thanks for looking into this.

I think the simple column-level aggregations are easy to generate directly on the UI (on the Simple Question sidebar) so I wouldn't worry too much about them.

The main problem I would like to solve is to implement Metrics across multiple fields such as:

I think it would be ideal that these expression strings are copy/pastable from the Metabase UI so I would add them directly as strings in dbt's properties. For example:

models:
- name: my_model
  # Model level meta tag for cross column metrics?
  meta:
    metrics:
    - name: distinct_hired_users
      aggregation: 'Distinct(case([step] = "hired", [candidate_user_id])'
    - name: pct_hired_users
      aggregation: 'Distinct(case([step] = "hired", [candidate_user_id]) / Distinct([candidate_user_id])'

The compiler you sent seems easy enough to convert to Python but if we try to parse string expressions we would need to implement all of this: https://github.com/metabase/metabase/blob/master/frontend/src/metabase/lib/expressions/parser.js and its dependencies. We need the Lexer, parser and grammar. It seems like too much work. Right?

z3z1ma commented 3 years ago

https://github.com/metriql/metriql/issues/13

Hey @remigabillet. Let me know what you think of this project (not sure if you have seen it)? They interface with BI as a rest service using Presto/Trino configured integrations. The con is the need for an additional service and its still in early stages. Interesting though.

Either way, I hear what your saying on the string parsing. Convenient and cool but definitely will take an external dependency like Metabases parser. Outside that, my approach isnt as copy paste easy, you define metrics in a different way, however if you only need to define them in one place and we back out the expected json for API consumption on Metabase's side then it shouldnt matter provided its still intuitive relative to the general flavor of dbt yml files.

remigabillet commented 3 years ago

I agree with you that declaring metrics in a tree-format in line with the Metabase API format would definitely be a big step in the right direction. Next we would have to resolve the field references before hitting the Metabase API with updates.

Anyway, I tried packaging the metabase expression parser into a JS packages executable with node but after many hours, I don't think it's going to work. I got stuck on browser dependencies which would need to be mocked, so I'll pass on that approach.

metriql is very cool. They point to a few really interesting articles in their docs, highlighting how critical it is to solve "Metrics" at the edges. I love it.

Another approach is to let metriql lead this work but it would be in their project. They already have a way to define metrics in DBT, and a parser in Kotlin. Next they would need to pull the Metabase schema info like we do here, remap fields, and convert their metrics in the Metabase format.

Separately, I'm still curious to explore building a Python parser for Metabase expressions. Maybe it's not as hard as it seems 🤔

buremba commented 3 years ago

Thanks for mentioning metriql, @z3z1ma!

We're extensively working on the integration with BI tools at the moment and Metabase is on our short-term radar. However; rather than embedding native Metabase expressions, we want to use Jinja just like how you use dbt:

models:
  - name: orders
     meta:
        metriql: 
           measures:
               total_rows:
                  aggregation: count
               custom_metric:
                  sql: "{{measure.total_rows}} * 2"
               custom_metric2:
                  dimension: country 
                  aggregation: count_unique

We actually use Python for other BI tool integrations (Tableau, Looker), my plan with Metabase is to parse these Jinja expressions in Python and generate Metabase expressions. Here are the benefits of this approach:

  1. The metrics will be easy to read/write as we're already familiar with JInja
  2. There won't be any hard dependency on Metabase source code
  3. Jinja also exposes its parser for low-level operations so it's easy (maybe not trivial) to build such transpiler
  4. You will be able to re-use dimensions and inject your dbt variables into the context for modularity.

If we manage to do that, the parser can also be used in dbt-metabase without metriql as it will just be a Python module. What do you think?

buremba commented 3 years ago

Also, I agree that maintaining another service but might an overhead but metriql tries to solve the following problems in addition to the metric definitions:

Join relations

While Metabase supports JOINs in data models, it's pretty limited. Metriql has MQL, a subset of Trino's SQL dialect. It exposes the semantic models to Metabase through the Trino interface as denormalized tables & columns and generates the JOINs in the engine.

Performance

If your tables are huge, the performance of ad-hoc queries in Metabase suffers. Looker has Aggregate Awareness to address this problem, and others such as Tableau & PowerBI pull the data into their system. In case your dbt models & sources are huge, metriql can automatically create roll-up tables inside your dbt project as dbt models and make use of them when Metabase queries the data via Aggregates.

Central metric definitions

If you have other data tools in your stack, and you may want to use the data models inside those tools as well. Some other tools don't have a UI. Instead, they need an API (similar to Airbnb's Minerva) to query the data. (Jupyter notebooks, CLIs, etc.)

If you're heavily on Metabase and you use dbt models extensively, you might not need these features so I want to make the BI tool integrations separated from metriql if possible. I would love to contribute to dbt-metabase as I will also be using it in metriql for updating the column types via Metabase API.

fabrice-etanchaud commented 3 years ago

Hi !

Concerning metrics, there is another dbt related project called lightdash, that aims to provide dataviz directly ontop of dbt's models. Lightdash reads dbt schemas' meta to infere measures, dimensions and metrics, and propose a way to define adhoc metrics directly in dbt schemas. I thought it could be worth mentioning it here. Thank you for providing us with this dbt/metabase bridge !

z3z1ma commented 3 years ago

@buremba

Thanks for the detailed reply!

my plan with Metabase is to parse these Jinja expressions in Python and generate Metabase expressions.

Can you elaborate further on what you mean here? What are Metabase expressions in this context? Do you mean to parse the above to a string like this Distinct(case([step] = "hired", [candidate_user_id])? If that is the case, there still is no way to consume that? If you had something else in mind, very interested.

My original thought went to using the Metric feature in Metabase since the metrics become available intuitively through the front end (see below) to both data analysts composing views for the company or for end-users performing self service analytics.

image

The only hurdle to the implementation is compiling the JSON expected: Example. ["sum",[column_1,column_id,null]] or division on two aggregates ["/", ["sum",[column_1,column_id,null]], ["sum",[column_2,column_id,null]]]

These are simple examples (which in all likelihood cover the 80% use case since our users are using dbt to output nice clean models) but in a perfect world we would include Metabase expressions like case. Aggregates like Distinct/Average/Count/Sum are all very straightforward. Haven't looked into non aggregates like case yet and what the JSON looks like.

Its really not too bad to compile the above once you analyze it- just wrapping my head around how in dbt yamls we can define these intuitively. The dev cycle for the above could actually be pretty short if we find direction with acceptable meta tag yaml definitions. That being said I'm open to suggestions, pros, cons.

Very interested in what you have in mind though.

remigabillet commented 3 years ago

@buremba what you're proposing sounds amazing.

If I understand correctly, you would first build a metriql-metabase python module which would parse metriql measures as defined here and convert them to Metabase expression that are compatible with the Metabase API?

https://www.metabase.com/docs/latest/users-guide/expressions.html https://www.metabase.com/docs/latest/api-documentation.html#post-apimetric

Part of that python module, do you also plan on converting field & table references into their Metabase internal IDs? It would require pulling the schema from Metabase (like this project does).

remigabillet commented 3 years ago

@buremba I feel that your work on metriql is super relevant to this issue. Do you mind giving us an update and let me know if I'm understanding it correctly?

z3z1ma commented 3 years ago

@remigabillet

I think I am getting somewhere. Perhaps we can transcribe what you envisioned. Will keep building this out.

image

remigabillet commented 3 years ago

very cool @z3z1ma! So you actually starting write a parser! Very cool!

Do you want to open a Draft PR, I'm happy to have a look, play with it and maybe contribute.

z3z1ma commented 3 years ago

Ok, after a nice chunk of work, I think I have the "money shot" here @remigabillet @gouline . I can open a draft PR later today to start working through and testing further as well as allowing introspection into code.

image

buremba commented 3 years ago

Sorry for the delay, I was out of work for a 10 days trip with limited internet access.

@z3z1ma:

Can you elaborate further on what you mean here? What are Metabase expressions in this context? Do you mean to parse the above to a string like this Distinct(case([step] = "hired", [candidate_user_id])? If that is the case, there still is no way to consume that? If you had something else in mind, very interested.

Metabase expressions are the expression syntax the Metabase API can understand in this context. metriql requires one of column, or sql field to be set for measures. aggregation is required except in the case that it's a non-aggreate measure using sql. There is also an additional filters property inside the measure definitions so you don't need to write CASE expressions. Here are a few examples:

measures:
  total_rows:
      aggregation: count
  filtered_measure:
     aggregation: count_unique
     column: candidate_user_id
     filters:
     - {column: step, operator: equals, value: 'hired'}
  non_aggregate_measure: 
     sql: '{{measure.filtered_measure}} / 2 * 10'

For the metric definitions, we need to know the aggregation type and aggregation filters separately in metriql as seen above. The Python project that we will develop needs to convert the expression above to a native Metabase expression, and synchronize them via the API.

For example:

  1. The total_rows measure will be compiled as:

    {
      "source-table": TABLE_ID,
      "aggregation": [
        [
          "aggregation-options",
            [
              "count"
            ]
        ]
      ]
    }

    The JSON blob above will be using as part of the PUT METABASE_URL/api/metric HTTP request.

  2. The filtered_measure above will be compiled to @z3z1ma's Distinct(case([step] = "hired", [candidate_user_id])) example using native Metabase filters as follows:

    {
      "source-table": TABLE_ID,
      "aggregation": [
        [
          "aggregation-options",
            [
              "distinct",
              [
                "field-id",
                CANDIDATE_USER_COLUMN_ID
              ]
            ]
        ]
      ],
      "filter": [
        "=",
        [
          "field-id",
          STEP_COLUMN_ID
        ],
        "hired"
      ]
    }
    }
  3. non_aggregate_measure might be a bit hard to implement since metriql allows raw SQL expressions whereas Metabase has its own syntax. Luckily, as @z3z1ma mentioned, the 1 and 2 covers 80% of the use-cases. Personally, I'm in favor of SQL rather than Metabase's expressions but it won't be possible to cover all the use-cases in dbt-metabase without metriql. We expose non-aggregate measures as database columns in metriql which bypasses the BI tool specific expressions for the record. Luckily, @z3z1ma's work can help here. Rather than raw SQL expressions, we can use Metabase expressions and use @z3z1ma's parser to build up the Metabase expressions. I was thinking of using sqlparse library and the generate the Metabase expression and the JInja parser in case you want to stick to the SQL approach. Metabase expressions are a limited subset of SQL with some minor changes such as field references. However, as @z3z1ma already started working on this, we can go with his approach.

Our approach is similar to LookML from Looker because the engine needs to understand the metric in a better way for better composability. If you have a filtered measure (your case example), it's likely that you want to use the same filter for other measures.

@remigabillet, yes, that's the plan! I will be using dbt-metabase for metriql for that use-cases, implement the metrics and hopefully contribute back to dbt-metabase if there is any interest.

@z3z1ma I'm assuming that you're planning to let people embed Metabase expressions into YML files, right? Let me know if you think that this approach is complementary or out of your scope. Also, this compilation will only be one-way so you won't be able to use existing Metabase expressions. Would that be a problem, what do you think?

jankatins commented 2 years ago

dbt now has metrics as well: https://github.com/dbt-labs/dbt-core/issues/4071 (discussion with link to the implementation PR)

willbryant commented 1 year ago

Yeah. Had a look into that, DBT metrics are sort of something different as they result in building pre-aggregated tables in the DW, not just defining metric expressions.

IMHO a more limited scope version of the requested feature is valuable - simply being able to put a meta: metabase/metrics list with the Metabase native expressions in the manifest.

Although this does not attack the much harder problems discussed above, it would be enough to provide a test -> production workflow and "keep metrics in sync with DBT models" mechanism, which for me at least, would be very valuable and is directly in line with the other functionality this project offers.

One awkward thing is that since there is no persistence mechanism from dbt-metabase runs, we can't store the key of the created metric. So we would have to just look for them by name and update or create if missing. This implies renaming or removing metrics can't really be supported. Not the end of the world but thought I'd flag it.

ValdarT commented 1 year ago

Significant changes are coming on dbt side so might be worth revisiting: https://www.getdbt.com/blog/dbt-semantic-layer-whats-next/

janmechtel commented 10 months ago

I was very excited to have found metriql through this thread, thank you.

However it seems the project is not very active.

Are there any similar solutions?

gouline commented 6 months ago

No current plans to implement this, closing for now. If anyone is interested in contributing an implementation after 1.0 is released, we can re-open and continue the discussion.