Open mkleinbort-ic opened 1 year ago
You can write the following:
df = (
pl.concat([df1, df2.select(pl.all().suffix('_right'))], how='horizontal')
.select([pl.struct(v1=c, v2=c+'_right').alias(c) for c in df1.columns])
)
Not sure if there should be functionality built in for this. It would probably be nice to have.
That's fair
I ended up with this implementation - I leave it here for reference in case someone else has a smiliar ask:
def zstack(
dfs:dict[str,pl.DataFrame]|list[pl.DataFrame],
columns:list[str]|pl.Expr|None=None,
keep_original:bool=False)->pl.DataFrame:
'''Zips list of dataframes so that the cells contain structs.
Parameters
----------
dfs: dict or list
If a dict is provided, the keys of the dict will be used as the field names in the structs.
If a list is provided, this is cast to a dict with key names 'df0', 'df1', ..., 'df{n}'.
columns: list[str] or None (default None)
List of the columns to be zipped into structs.
If None it defaults to all the columns.
keep_original: bool (default False)
Keep the original columns with the prefixes.
Examples
--------
>>> df_ans = zstack({
... 'A': df1,
... 'B': df2
... }, columns=pl.all().exclude('userId'), keep_original=False)
'''
if not isinstance(dfs, dict): # Convert a list of dataframes into a dict
dfs = {f'df{i}':df for i, df in enumerate(dfs)}
assert len({df.shape for df in dfs.values()}) == 1, 'The input dataframes are not all the same shape'
assert len({tuple(df.columns) for df in dfs.values()}) == 1, 'The input dataframes are dont have consistent column names'
struct_field_names = list(dfs.keys())
df0 = dfs[struct_field_names[0]] # Used for referencing column names
all_original_columns = df0.columns.copy()
# Figure out which columns will be converted into structs
if columns is None:
columns = all_original_columns
elif isinstance(columns, pl.Expr):
columns = df0.select(columns).columns
else: # An explicit list of columns was provided
assert all(c in all_original_columns for c in columns), 'The provided columns are not a subset of the columns in the dataframes'
other_columns = [c for c in all_original_columns if c not in columns]
dfs_suffixed = [
df.select(pl.col(columns).suffix(f'_{key}')) for key, df in dfs.items()
]
col_tuples = [
[f'{c}_{key}' for key in dfs.keys()] for c in columns
]
df_combined = pl.concat(dfs_suffixed, how='horizontal')
# This defines the struct columns
expressions = [pl.struct(**{k:ci for k,ci in zip(struct_field_names, cols)}).alias(c) for c, cols in zip(columns, col_tuples)]
df = df_combined.with_columns(
*expressions
)
df = pl.concat((df0.select(other_columns), df), how='horizontal')
if keep_original is False:
df = df.select(all_original_columns)
return df
Yep, that's about how I would do it.
I'll leave this issue open in case we want to implement this as part of Polars.
Is this called z-stacking elsewhere? If this was implemented, I wonder if we could also have an easy way to reference each stack. Like df.select_stack(0).select("...")
.
It's sort of a cover-up way of having 3D frames, represented internally by structs. It might end up adding a ton of complexity, unless we didn't allow many operations on the stacked frames.
Two thoughts on the above:
Not sure, I was trying to keep the terminology consistent with h-stacking, and v-stacking (though then should it have been d(epthwise)-stacking?)
I've seen it here
I don't support this, polars is a great library for tabular data analysis, and the API would get horribly complex if you start to support n-D frames
That said, I'd love for the xarray team to be inspored by polars and build something a bit more modern.
That was my thought as well: if you're going to use 3D frames, use numpy, as you're probably not dealing with dataframes anymore.
I definitely wouldn't call this z-stack. The name is really unclear to me and I don't want to think in any other dimension than horizontally or vertically with DataFrame
s.
However it looks to me like a zip
operation. We might have a zip_columns
operation. I do think it should be constrained by equal schema and size.
Yes, to be fair in my code it's called zip_with
e.g.
df.my_namespace.zip_with(df2)
One challenge I had was that
df.zip_with(df2).zip_with(df3)
Is obviously is not the same as
df.zip_with([df2, df3])
Here is my latest implementation:
def zip_polar_frames(dfs:dict[str,pl.DataFrame]|list[pl.DataFrame], columns:list[str]|pl.Expr|None=None, keep_original=False)->pl.DataFrame:
'''Zips list of dataframes so that the cells contain structs.
Parameters
----------
dfs: dict or list
If a dict is provided, the keys of the dict will be used as the field names in the structs.
If a list is provided, this is cast to a dict with key names 'df0', 'df1', ..., 'df{n}'.
columns: list[str] or None (default None)
List of the columns to be zipped into structs.
If None it defaults to all the columns.
keep_original: bool (default False)
Keep the original columns with the prefixes.
'''
if not isinstance(dfs, dict):
dfs = {f'df{i}':df for i, df in enumerate(dfs)}
assert len({df.shape for df in dfs.values()}) == 1, 'The input dataframes are not all the same shape'
assert len({tuple(df.columns) for df in dfs.values()}) == 1, 'The input dataframes are dont have consistent column names'
struct_field_names = list(dfs.keys())
df0 = dfs[struct_field_names[0]] # Used for referencing column names
all_original_columns = df0.columns.copy()
if columns is None:
columns = all_original_columns
elif isinstance(columns, pl.Expr):
columns = df0.select(columns).columns
else:
assert all(c in all_original_columns for c in columns), 'The provided columns are not a subset of the columns in the dataframes'
other_columns = [c for c in all_original_columns if c not in columns]
dfs_suffixed = [
df.select(pl.col(columns).suffix(f'_{key}')) for key, df in dfs.items()
]
col_tuples = [
[f'{c}_{key}' for key in dfs.keys()] for c in columns
]
df_combined = pl.concat(dfs_suffixed, how='horizontal')
# This defines the struct columns
expressions = [pl.struct(**{k:ci for k,ci in zip(struct_field_names, cols)}).alias(c) for c, cols in zip(columns, col_tuples)]
df = df_combined.with_columns(
*expressions
)
df = pl.concat((df0.select(other_columns), df), how='horizontal')
if keep_original is False:
df = df.select(all_original_columns)
return df
def _zip_with(self, dfs:pl.DataFrame|list[pl.DataFrame], keys:list[str]=None, columns=pl.all(), keep_original:bool=False):
if isinstance(dfs, list):
dfs = [self] + list(dfs)
else:
dfs = [self, dfs]
assert keys == None or len(keys) == len(dfs)
if keys is not None and len(keys) == len(dfs):
dfs = {k:df for k,df in zip(keys, dfs)}
return zip_polar_frames(dfs, columns=columns, keep_original=keep_original)
All that said, I don't think this should be in the core API (it's doing too much stuff)
Problem description
The ask is for a way to stack polar dataframes along the z-direction
Example:
The output is equivalent to
Note that 'v1', 'v2' as keys is arbitrary - maybe this should be a parameter
field_names
of typelist[str]
to the zstack method.