pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.67k stars 1.99k forks source link

Column level lineage #11031

Open Haeri opened 1 year ago

Haeri commented 1 year ago

Problem description

Dear polars team,

It would be very useful to introduce a feature that allows tracing back columns within dataframes and exposing their composition in the Polars library. Such functionality would have a significant impact on data governance and pipeline validation.

On the user side, implementing such a feature is challenging due to the unpredictability of transformations. However, within the library itself, capturing the inputs used to generate new outputs should be a straightforward process.

For the implementation, my naive suggestion would be to introduce an UUID list per column and inherit those IDs for every transformation that takes place.

For example:

DF_A:

now concatenating name and last_name and creating full name could give

DF_B:

Therefore, we would know from which columns full_name was constructed without looking into the code.

Best regards, Haeri

MarcoGorelli commented 1 year ago

If I've understood the request, this is already possible with Expr.meta:

In [19]: full_name = (pl.col('name') + pl.col('last_name')).alias('full_name')

In [20]: full_name.meta.root_names()
Out[20]: ['name', 'last_name']

In [21]: full_name.meta.output_name()
Out[21]: 'full_name'
stinodego commented 1 year ago

Indeed, as Marco showed, you can do this with the meta namespace.

If this does not address your needs, please re-open this and be more specific on the exact functionality you're looking for.

Haeri commented 1 year ago

Thanks for the quick response!

The meta suggestion seems to be very close to what I need but unfortunately, I was not able to get it fully working. As it seems meta is only available on an expression rather than a dataframe. Is it possible to extract meta also for a dataframe?

Here is my expected code:

import polars as pl

df = pl.DataFrame({
    'id': [1, 2, 3],
    'name': ['peter', 'andreas', 'alex'],
    'last_name': ['parker', 'ânderson', 'kordz'],
})

concat_df = df.with_columns(
    pl.concat_str(
        [
            pl.col("name"),
            pl.col("last_name"),
        ],
        separator=" ",
    ).alias("full_name"),
)

print(concat_df["full_name"].meta.root_names()) # AttributeError: 'Series' object has no attribute 'meta'
cmdlineluser commented 1 year ago

I think your updated example warrants re-opening this as @stinodego suggested.

If this does not address your needs, please re-open this and be more specific on the exact functionality you're looking for.

Haeri commented 1 year ago

Unfortunately, I am unable to re-open my own issue, since it was closed by a repo collaborator.

ldacey commented 1 year ago

I had no idea about the meta argument, that is neat. I wonder if I can add column lineage in OpenLineage using that. Is there some description about the Expr or is it serializable? For example, is there any meaningful data I could add in the "transformationType" or "transformationDescription"?

https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ColumnLineageDatasetFacet.md

  "inputs": [
    {
      "namespace": "my-datasource-namespace",
      "name": "instance.schema.table",
      "facets": {
        "schema": {
          "fields": [
            { "name": "ia", "type": "INT"},
            { "name": "ib", "type": "INT"}
          ]
        },
      }
    }
  ],
  "outputs": [
    {
      "namespace": "my-datasource-namespace",
      "name": "instance.schema.output_table",
      "facets": {
        "schema": {
          "fields": [
            { "name": "a", "type": "INT"},
            { "name": "b", "type": "INT"}
          ]
        },
        "columnLineage": {
          "fields": {
            "a": {
              "inputFields": [
                {namespace: "my-datasource-namespace", name: "instance.schema.table", "field": "ia"},
                ... other inputs
              ],
              transformationDescription: "identical",
              transformationType: "IDENTITY"
            },
            "b": ... other output fields
          }
        }
      }
    }
  ],
cmdlineluser commented 1 year ago

@ldacey There is meta.write_json() (just using json.loads here to "pretty print" the result)

json.loads(
    pl.concat_str(
       [
          pl.col("name"),
          pl.col("last_name"),
       ],
       separator=" ",
    ).alias("full_name").meta.write_json()
)
{'Alias': [{'Function': {'input': [{'Column': 'name'},
     {'Column': 'last_name'}],
    'function': {'StringExpr': {'ConcatHorizontal': ' '}},
    'options': {'collect_groups': 'ApplyFlat',
     'fmt_str': '',
     'input_wildcard_expansion': True,
     'auto_explode': True,
     'cast_to_supertypes': False,
     'allow_rename': False,
     'pass_name_to_apply': False,
     'changes_length': False,
     'check_lengths': True,
     'allow_group_aware': True}}},
  'full_name']}
Haeri commented 11 months ago

Any chance for this to be considered in the future?