ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.24k stars 591 forks source link

feat(Impala): support .distinct() for a subset of columns #10370

Open contang0 opened 1 day ago

contang0 commented 1 day ago

Is your feature request related to a problem?

At the moment Impala backend only supports .distinct() on a full table.

This works:

table.distinct()

This does not:

table.distinct(on=['col1', 'col2'])

Translation to backend failed
Error message: OperationNotDefinedError("Compilation rule for 'First' operation is not defined")

What is the motivation behind your request?

This forces me to write verbose workarounds.

.distinct() on a subset of a table is pretty fundamental, in my view.

Describe the solution you'd like

The on clause in .distinct() should work.

What version of ibis are you running?

10.5

What backend(s) are you using, if any?

Impala

Code of Conduct

NickCrews commented 1 day ago

workaround in the meantime should be something like

def distinct(t, on):
    aggs = {col: t[col].arbitrary() for col in t.columns if col not in on}
    return t. group_by(on).agg(**aggs)
contang0 commented 1 day ago

Thank you, will give it a try!