Support for zstacking dataframes

mkleinbort-ic commented 1 year ago

Problem description

The ask is for a way to stack polar dataframes along the z-direction

Example:

df1 = pl.DataFrame({'x1': [1,2,3], 'x2': ['a','b','c']})

shape: (3, 2)
┌─────┬─────┐
│ x1  ┆ x2  │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ a   │
│ 2   ┆ b   │
│ 3   ┆ c   │
└─────┴─────┘

df2 = pl.DataFrame({'x1': [True,False,False], 'x2': ['A','B','C']})

shape: (3, 2)
┌───────┬─────┐
│ x1    ┆ x2  │
│ ---   ┆ --- │
│ bool  ┆ str │
╞═══════╪═════╡
│ true  ┆ A   │
│ false ┆ B   │
│ false ┆ C   │
└───────┴─────┘

>>> df1.zstack(df2)

shape: (3, 2)
┌───────────┬───────────┐
│ x1        ┆ x2        │
│ ---       ┆ ---       │
│ struct[2] ┆ struct[2] │
╞═══════════╪═══════════╡
│ {1,true}  ┆ {"a","A"} │
│ {2,false} ┆ {"b","B"} │
│ {3,false} ┆ {"c","C"} │
└───────────┴───────────┘

The output is equivalent to

pl.DataFrame({
    'x1': [
        {'v1': 1, 'v2':True},
        {'v1': 2, 'v2':False},
        {'v1': 3, 'v2':False},
    ],
    'x2': [
        {'v1': 'a', 'v2':'A'},
        {'v1': 'b', 'v2':'B'},
        {'v1': 'c', 'v2':'C'},
    ]
})

Note that 'v1', 'v2' as keys is arbitrary - maybe this should be a parameter field_names of type list[str] to the zstack method.

stinodego commented 1 year ago

You can write the following:

df = (
    pl.concat([df1, df2.select(pl.all().suffix('_right'))], how='horizontal')
    .select([pl.struct(v1=c, v2=c+'_right').alias(c) for c in df1.columns])
)

Not sure if there should be functionality built in for this. It would probably be nice to have.

mkleinbort-ic commented 1 year ago

That's fair

I ended up with this implementation - I leave it here for reference in case someone else has a smiliar ask:

def zstack(
    dfs:dict[str,pl.DataFrame]|list[pl.DataFrame], 
    columns:list[str]|pl.Expr|None=None, 
    keep_original:bool=False)->pl.DataFrame:
    '''Zips list of dataframes so that the cells contain structs.

    Parameters
    ----------
    dfs: dict or list
        If a dict is provided, the keys of the dict will be used as the field names in the structs.
        If a list is provided, this is cast to a dict with key names 'df0', 'df1', ..., 'df{n}'.

    columns: list[str] or None (default None)
        List of the columns to be zipped into structs.
        If None it defaults to all the columns.

    keep_original: bool (default False)
        Keep the original columns with the prefixes.

    Examples
    --------
    >>> df_ans = zstack({
    ...     'A':  df1, 
    ...     'B':  df2
    ... }, columns=pl.all().exclude('userId'), keep_original=False)

    '''

    if not isinstance(dfs, dict): # Convert a list of dataframes into a dict
        dfs = {f'df{i}':df for i, df in enumerate(dfs)}  

    assert len({df.shape for df in dfs.values()}) == 1, 'The input dataframes are not all the same shape'
    assert len({tuple(df.columns) for df in dfs.values()}) == 1, 'The input dataframes are dont have consistent column names'

    struct_field_names = list(dfs.keys())
    df0 = dfs[struct_field_names[0]] # Used for referencing column names
    all_original_columns = df0.columns.copy()

    # Figure out which columns will be converted into structs
    if columns is None:
        columns = all_original_columns
    elif isinstance(columns, pl.Expr):
        columns = df0.select(columns).columns
    else: # An explicit list of columns was provided
        assert all(c in all_original_columns for c in columns), 'The provided columns are not a subset of the columns in the dataframes'

    other_columns = [c for c in all_original_columns if c not in columns]

    dfs_suffixed = [
            df.select(pl.col(columns).suffix(f'_{key}')) for key, df in dfs.items()
        ]

    col_tuples = [
        [f'{c}_{key}' for key in dfs.keys()] for c in columns
    ]

    df_combined = pl.concat(dfs_suffixed, how='horizontal')

    # This defines the struct columns
    expressions = [pl.struct(**{k:ci for k,ci in zip(struct_field_names, cols)}).alias(c) for c, cols in zip(columns, col_tuples)]

    df = df_combined.with_columns(
        *expressions
    )

    df = pl.concat((df0.select(other_columns), df), how='horizontal')

    if keep_original is False:
        df = df.select(all_original_columns)

    return df

stinodego commented 1 year ago

Yep, that's about how I would do it.

I'll leave this issue open in case we want to implement this as part of Polars.

mcrumiller commented 1 year ago

Is this called z-stacking elsewhere? If this was implemented, I wonder if we could also have an easy way to reference each stack. Like df.select_stack(0).select("...").

It's sort of a cover-up way of having 3D frames, represented internally by structs. It might end up adding a ton of complexity, unless we didn't allow many operations on the stacked frames.

mkleinbort-ic commented 1 year ago

Two thoughts on the above:

Is it called z-stacking?

Not sure, I was trying to keep the terminology consistent with h-stacking, and v-stacking (though then should it have been d(epthwise)-stacking?)

I've seen it here

It's comon in deep learning (tensors)
It's common in geospatial data (mainly via the xarray package)
Some data structures are nice to "stack" - eg. the features of a machine learning model and their corresponding shap_values, the R,G,B channels in an image, etc..

Regarding select_stack and 3D frames in general...

I don't support this, polars is a great library for tabular data analysis, and the API would get horribly complex if you start to support n-D frames

That said, I'd love for the xarray team to be inspored by polars and build something a bit more modern.

mcrumiller commented 1 year ago

That was my thought as well: if you're going to use 3D frames, use numpy, as you're probably not dealing with dataframes anymore.

ritchie46 commented 1 year ago

I definitely wouldn't call this z-stack. The name is really unclear to me and I don't want to think in any other dimension than horizontally or vertically with DataFrames.

However it looks to me like a zip operation. We might have a zip_columns operation. I do think it should be constrained by equal schema and size.

mkleinbort-ic commented 1 year ago

Yes, to be fair in my code it's called zip_with

e.g.

df.my_namespace.zip_with(df2)

One challenge I had was that

df.zip_with(df2).zip_with(df3)

Is obviously is not the same as

df.zip_with([df2, df3])

Here is my latest implementation:

def zip_polar_frames(dfs:dict[str,pl.DataFrame]|list[pl.DataFrame], columns:list[str]|pl.Expr|None=None, keep_original=False)->pl.DataFrame:
    '''Zips list of dataframes so that the cells contain structs.

    Parameters
    ----------
    dfs: dict or list
        If a dict is provided, the keys of the dict will be used as the field names in the structs.
        If a list is provided, this is cast to a dict with key names 'df0', 'df1', ..., 'df{n}'.

    columns: list[str] or None (default None)
        List of the columns to be zipped into structs.
        If None it defaults to all the columns.

    keep_original: bool (default False)
        Keep the original columns with the prefixes.

    '''

    if not isinstance(dfs, dict):
        dfs = {f'df{i}':df for i, df in enumerate(dfs)}  

    assert len({df.shape for df in dfs.values()}) == 1, 'The input dataframes are not all the same shape'
    assert len({tuple(df.columns) for df in dfs.values()}) == 1, 'The input dataframes are dont have consistent column names'

    struct_field_names = list(dfs.keys())
    df0 = dfs[struct_field_names[0]] # Used for referencing column names
    all_original_columns = df0.columns.copy()

    if columns is None:
        columns = all_original_columns
    elif isinstance(columns, pl.Expr):
        columns = df0.select(columns).columns
    else:
        assert all(c in all_original_columns for c in columns), 'The provided columns are not a subset of the columns in the dataframes'

    other_columns = [c for c in all_original_columns if c not in columns]

    dfs_suffixed = [
            df.select(pl.col(columns).suffix(f'_{key}')) for key, df in dfs.items()
        ]

    col_tuples = [
        [f'{c}_{key}' for key in dfs.keys()] for c in columns
    ]

    df_combined = pl.concat(dfs_suffixed, how='horizontal')

    # This defines the struct columns
    expressions = [pl.struct(**{k:ci for k,ci in zip(struct_field_names, cols)}).alias(c) for c, cols in zip(columns, col_tuples)]

    df = df_combined.with_columns(
        *expressions
    )

    df = pl.concat((df0.select(other_columns), df), how='horizontal')

    if keep_original is False:
        df = df.select(all_original_columns)

    return df 

def _zip_with(self, dfs:pl.DataFrame|list[pl.DataFrame], keys:list[str]=None, columns=pl.all(), keep_original:bool=False):

    if isinstance(dfs, list):
        dfs = [self] + list(dfs)
    else:
        dfs = [self, dfs]

    assert keys == None or len(keys) == len(dfs) 
    if keys is not None and len(keys) == len(dfs):
        dfs = {k:df for k,df in zip(keys, dfs)}

    return zip_polar_frames(dfs, columns=columns, keep_original=keep_original)

All that said, I don't think this should be in the core API (it's doing too much stuff)

pola-rs / polars