malloydata / malloy

Malloy is an experimental language for describing data relationships and transformations.
http://www.malloydata.dev
MIT License
1.96k stars 76 forks source link

Write column lineage to query metadata #1141

Open carlineng opened 1 year ago

carlineng commented 1 year ago

From a conversation today with @lloydtabb --

For any given field in a query, Malloy understands what were all the inputs to that field, and thus we can fully understand the lineage of every column. We should write this lineage to query metadata.

For example, if we have a source:

source: order_items is order_items_table + {
  primary_key: id

  join_one:
    users with user_id

  measure:
    revenue_per_user is sale_price.sum() / count(distinct users.id)

and a query:

query: order_items -> {
  group_by: created_month is created_at.month
  aggregate: revenue_per_user
}

there should be metadata associated with each column in the resultset that describes the lineage of that column.

In this example, created_month comes from order_items.created_at, and revenue_per_user comes from order_items.sale_price and order_items.users.id. This is a simple example, but if one or more of those fields was another complex field that had multiple inputs, then this lineage forms a tree structure.

A few examples of tools that support this:

kylelundstedt commented 7 months ago

I'm not sure @carlineng, but it seems like many databases (e.g., DuckDB, Snowflake) have a COMMENT ON SQL statement that would allow this column lineage information to be added to the metadata for the column, too.