Open Haeri opened 1 year ago
If I've understood the request, this is already possible with Expr.meta
:
In [19]: full_name = (pl.col('name') + pl.col('last_name')).alias('full_name')
In [20]: full_name.meta.root_names()
Out[20]: ['name', 'last_name']
In [21]: full_name.meta.output_name()
Out[21]: 'full_name'
Indeed, as Marco showed, you can do this with the meta
namespace.
If this does not address your needs, please re-open this and be more specific on the exact functionality you're looking for.
Thanks for the quick response!
The meta suggestion seems to be very close to what I need but unfortunately, I was not able to get it fully working. As it seems meta is only available on an expression rather than a dataframe. Is it possible to extract meta also for a dataframe?
Here is my expected code:
import polars as pl
df = pl.DataFrame({
'id': [1, 2, 3],
'name': ['peter', 'andreas', 'alex'],
'last_name': ['parker', 'ânderson', 'kordz'],
})
concat_df = df.with_columns(
pl.concat_str(
[
pl.col("name"),
pl.col("last_name"),
],
separator=" ",
).alias("full_name"),
)
print(concat_df["full_name"].meta.root_names()) # AttributeError: 'Series' object has no attribute 'meta'
I think your updated example warrants re-opening this as @stinodego suggested.
If this does not address your needs, please re-open this and be more specific on the exact functionality you're looking for.
Unfortunately, I am unable to re-open my own issue, since it was closed by a repo collaborator.
I had no idea about the meta argument, that is neat. I wonder if I can add column lineage in OpenLineage using that. Is there some description about the Expr or is it serializable? For example, is there any meaningful data I could add in the "transformationType" or "transformationDescription"?
https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/ColumnLineageDatasetFacet.md
"inputs": [
{
"namespace": "my-datasource-namespace",
"name": "instance.schema.table",
"facets": {
"schema": {
"fields": [
{ "name": "ia", "type": "INT"},
{ "name": "ib", "type": "INT"}
]
},
}
}
],
"outputs": [
{
"namespace": "my-datasource-namespace",
"name": "instance.schema.output_table",
"facets": {
"schema": {
"fields": [
{ "name": "a", "type": "INT"},
{ "name": "b", "type": "INT"}
]
},
"columnLineage": {
"fields": {
"a": {
"inputFields": [
{namespace: "my-datasource-namespace", name: "instance.schema.table", "field": "ia"},
... other inputs
],
transformationDescription: "identical",
transformationType: "IDENTITY"
},
"b": ... other output fields
}
}
}
}
],
@ldacey There is meta.write_json()
(just using json.loads
here to "pretty print" the result)
json.loads(
pl.concat_str(
[
pl.col("name"),
pl.col("last_name"),
],
separator=" ",
).alias("full_name").meta.write_json()
)
{'Alias': [{'Function': {'input': [{'Column': 'name'},
{'Column': 'last_name'}],
'function': {'StringExpr': {'ConcatHorizontal': ' '}},
'options': {'collect_groups': 'ApplyFlat',
'fmt_str': '',
'input_wildcard_expansion': True,
'auto_explode': True,
'cast_to_supertypes': False,
'allow_rename': False,
'pass_name_to_apply': False,
'changes_length': False,
'check_lengths': True,
'allow_group_aware': True}}},
'full_name']}
Any chance for this to be considered in the future?
Problem description
Dear polars team,
It would be very useful to introduce a feature that allows tracing back columns within dataframes and exposing their composition in the Polars library. Such functionality would have a significant impact on data governance and pipeline validation.
On the user side, implementing such a feature is challenging due to the unpredictability of transformations. However, within the library itself, capturing the inputs used to generate new outputs should be a straightforward process.
For the implementation, my naive suggestion would be to introduce an UUID list per column and inherit those IDs for every transformation that takes place.
For example:
DF_A:
now concatenating name and last_name and creating full name could give
DF_B:
Therefore, we would know from which columns full_name was constructed without looking into the code.
Best regards, Haeri