ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.23k stars 592 forks source link

feat: support unnest inside aggregate #9048

Open amoeba opened 6 months ago

amoeba commented 6 months ago

Is your feature request related to a problem?

No response

What is the motivation behind your request?

In older versions of Ibis (I think before the sqlglot refactor?), I was able to use unnest inside my aggregate calls. Something this once worked:

t = ibis.memtable({"id": [1, 1,2], "name":["a", "a", "b"]})
t
┏━━━━━━━┳━━━━━━━━┓
┃ id    ┃ name   ┃
┡━━━━━━━╇━━━━━━━━┩
│ int64 │ string │
├───────┼────────┤
│     1 │ a      │
│     1 │ a      │
│     2 │ b      │
└───────┴────────┘
t.group_by(t.id).aggregate(name=t.name.collect().unique().unnest())

At least with Ibis 8.0.0, it gives an error which includes text like "is not a tuple of coercibles to a Value[+T, Scalar]" presumably because it's getting the wrong shape. The direct solution today is to move the unnest operation into a mutate call,

t.group_by(t.id).aggregate(name=t.name.collect().unique()).mutate(name=_.name.unnest())

However, this type of operation composes pretty naturally with DuckDB and I don't have to think about breaking the unnest call out from the rest of my aggregation:

CREATE TABLE t (id INTEGER, name TEXT);
INSERT INTO t (id, name) VALUES (1, 'a'), (1, 'a'), (2, 'b');
SELECT id, unnest(list_distinct(list(name))) as name FROM t GROUP BY id;
┌───────┬─────────┐
│  id   │  name   │
│ int32 │ varchar │
├───────┼─────────┤
│     1 │ a       │
│     2 │ b       │
└───────┴─────────┘

Describe the solution you'd like

It would be nice if a more actionable error were produced or if this was just possible without any changes.

What version of ibis are you running?

8.0.0

What backend(s) are you using, if any?

DuckDB, memtable

Code of Conduct

ncclementi commented 6 months ago

@amoeba thanks for the report, I can reproduce this on main too. Not sure what's happening yet, but leaving the traceback here for future reference

SignatureValidationError                  Traceback (most recent call last)
Cell In[4], line 1
----> 1 t.group_by(t.id).aggregate(name=t.name.collect().unique().unnest())

File ~/Documents/git/my_forks/ibis/ibis/expr/types/groupby.py:68, in GroupedTable.aggregate(self, *metrics, **kwds)
     66 def aggregate(self, *metrics, **kwds) -> ir.Table:
     67     """Compute aggregates over a group by."""
---> 68     return self.table.to_expr().aggregate(
     69         metrics, by=self.groupings, having=self.havings, **kwds
     70     )

File ~/Documents/git/my_forks/ibis/ibis/expr/types/relations.py:1159, in Table.aggregate(self, metrics, by, having, **kwargs)
   1156             metrics[metric.name] = metric
   1158 # construct the aggregate node
-> 1159 agg = ops.Aggregate(node, groups, metrics).to_expr()
   1161 if having:
   1162     # apply the having clause
   1163     agg = agg.filter(*having)

File ~/Documents/git/my_forks/ibis/ibis/common/bases.py:72, in AbstractMeta.__call__(cls, *args, **kwargs)
     52 def __call__(cls, *args, **kwargs):
     53     """Create a new instance of the class.
     54 
     55     The subclass may override the `__create__` classmethod to change the
   (...)
     70 
     71     """
---> 72     return cls.__create__(*args, **kwargs)

File ~/Documents/git/my_forks/ibis/ibis/common/grounds.py:119, in Annotable.__create__(cls, *args, **kwargs)
    116 @classmethod
    117 def __create__(cls, *args: Any, **kwargs: Any) -> Self:
    118     # construct the instance by passing only validated keyword arguments
--> 119     kwargs = cls.__signature__.validate(cls, args, kwargs)
    120     return super().__create__(**kwargs)

File ~/Documents/git/my_forks/ibis/ibis/common/annotations.py:497, in Signature.validate(self, func, args, kwargs)
    494         this[name] = result
    496 if errors:
--> 497     raise SignatureValidationError(
    498         "{call} has failed due to the following errors:{errors}\n\nExpected signature: {sig}",
    499         sig=self,
    500         func=func,
    501         args=args,
    502         kwargs=kwargs,
    503         errors=errors,
    504     )
    506 return this

SignatureValidationError: Aggregate(<ibis.expr.operations.relations.InMemoryTable object at 0x11fff1d50>, {'id': <ibis.expr.operations.relations.Field object at 0x1058a3880>}, {'name': <ibis.expr.operations.arrays.Unnest object at 0x12c1c32e0>}) has failed due to the following errors:
  `metrics`: {'name': <ibis.expr.operations.arrays.Unnest object at 0x12c1c32e0>} is not matching GenericMappingOf(key=InstanceOf(type=<class 'str'>), value=AllOf(patterns=(GenericCoercedTo(origin=<class 'ibis.expr.operations.core.Value'>, params={'T': +T, 'S': <class 'ibis.expr.datashape.Scalar'>}, checker=GenericInstanceOf(type=ibis.expr.operations.core.Value[+T, ibis.expr.datashape.Scalar], origin=<class 'ibis.expr.operations.core.Value'>, fields={'dtype': CoercedTo(type=<class 'ibis.expr.datatypes.core.DataType'>, func=<bound method DataType.__coerce__ of <class 'ibis.expr.datatypes.core.DataType'>>), 'shape': InstanceOf(type=<class 'ibis.expr.datashape.Scalar'>)})), Not(pattern=InstanceOf(type=<class 'ibis.expr.operations.core.Alias'>)))), type=CoercedTo(type=<class 'ibis.common.collections.FrozenDict'>, func=<class 'ibis.common.collections.FrozenDict'>))

Expected signature: Aggregate(parent: Relation, groups: FrozenDict[str, Annotated[Value[+T, Columnar], Not(pattern=InstanceOf(type=<class 'Alias'>))]], metrics: FrozenDict[str, Annotated[Value[+T, Scalar], Not(pattern=InstanceOf(type=<class 'Alias'>))]])