Julian-J-S commented 1 year ago

Problem description

this is an extension to #5590 with some examples and explanations.

Problems / Issues

unique: There is no way to remove ALL duplicates like in pandas (currently only possible to keep first/last of duplicate row)
unique / is_unique: There is no way to get the same result using unique and filtering with is_unique if duplicates are present which is really awkward
duplicate: There is no consise/semantic way to get the duplicate rows. duplicate should be added for completeness (is_unique + unique available but only is_duplicated)
is_duplicated: There is no keep argument like in pandas duplicated to specify which duplicates to mark
There is no fast / elegant way to get boolean mask of a column containing lists with duplicates. has_duplicates would be nice on Expr.arr (currently using some workaround with unqiue length and original length)

1. `unique`: add another `keep` option

polars unique cannot remove ALL duplicates from the original data

it only has the ability the keep the first or last duplicate row

df = pl.DataFrame({"a": [1, 2, 3, 1]})
df.unique(keep="first")
> [1, 2, 3]
df.unique(keep="last")
> [2, 3, 1]

I would like to be able to remove all duplicates
pandas has drop_duplicates with keep=False

someting like the following would be nice:

df.unique(keep="none") # doesn't exist
> [2, 3]

2. `unique` and `is_unique`: inconsistent results

polars unique and is_unique NEVER return the same result if duplicates are present which feels awkward:

df = pl.DataFrame({"a": [1, 2, 3, 1]})
df.unique(keep="first") 
> [1, 2, 3]
df.unique(keep="last")
> [2, 3, 1]
df.filter(pl.col("a").is_unique())
> [2, 3]

is_unique should also have the keep argument (first/last/none)
keep="none" would be the current behavior of is_unique
keep="first/last" is like reading a book and marking the first/last time you see a word

df.is_unique(keep="none")  # default / current behavior
> [False, True, True, False]
df.is_unique(keep="first")
> [True, True, True, False]
df.is_unique(keep="last")
> [False, True, True, True]

3. `duplicate` method

with is_unique + unique + is_duplicated available, it feels like duplicate is missing
ofc this could be achieved with available methods but having a clean, consice and consistent API is very important imo
duplicate should have the same subset and keep arguments as unique

df = pl.DataFrame({"a": [1, 2, 3, 2, 1]})
df.duplicate(keep="all")
> [1, 2, 2, 1]
df.duplicate(keep="first")
> [1, 2]
df.duplicate(keep="last")
> [2, 1]

4. `is_duplicated`: add `keep` argument

polars is_duplicated is missing the keep argument
current behavior:

df = pl.DataFrame({"a": [1, 2, 3, 2, 1]})
df.is_duplicated()
> [True, True, False, True, True]

adding keep argument:

df.is_duplicated(keep="all")  # default
> [True, True, False, True, True]
df.is_duplicated(keep="first")
> [True, True, False, False, False]
df.is_duplicated(keep="last")
> [False, False, False, True, True]

5. `has_duplicates` on `Expr.arr`

currently filtering columns of lists for duplicates is a bit awkward (maybe there is a better way?)

df = pl.DataFrame({'a': [[1, 2, 1], [3, 4, 5], [6, 6]]})
df_arr.filter(
    pl.col('a').arr.lengths()
    != pl.col('a').arr.unique().arr.lengths()
)
> [[1, 2, 1], [6, 6]]

has_duplicates would be nice to have on Expr.arr
this could also use short-circuiting and be much faster

df_arr.filter(
    pl.col('a').arr.has_duplicates()
)

Summary: Feature Requests

unique: add keep="none" (or None/False) to remove all duplicates (available in pandas drop_duplicates)
is_unique: add keep argument to match unique and get consistent results
duplicate: add this method to complement is_unique, unique and is_duplicated
is_duplicated: add keep argument to match unique (available in pandas duplicated)
Expr.arr.has_duplicates: add this to get fast/efficient boolean mask if list contains duplicates

ritchie46 commented 1 year ago

We can filter by columns that are unique: is_unique.

df = pl.DataFrame({"a": [1, 2, 3, 1]})
df.filter(pl.col("a").is_unique())

We changed the name from distinct to unique. This discussion has been had before, maybe distinct is a better name. I don't think is_unique and distinct should have the same results. Maybe we should name the method distinct to prevent this semantic confusion.

is_unique should only have one answer, if a column value is unique, meaning there is only one in that set. If you want the first, you combing is_unique and is_first, that's the whole idea of expressions: reduce API surface so that you can combine and cherry pick the logic you need.

The same logic applies to is_duplicated, it should only have a yes or no answer.

Expr.arr.has_duplicates: add this to get fast/efficient boolean mask if list contains duplicates

Could you make a separate issue for this one? I think we should add this.

unique: add keep="none" (or None/False) to remove all duplicates (available in pandas drop_duplicates)

I think we can add this options :+1:

DrMaphuse commented 1 year ago

is_unique should only have one answer, if a column value is unique

How can I replicate the subset= argument of unique() using is_first, i.e. evaluate uniqueness across multiple columns? As of now, when I pass multiple columns into is_first, it evaluates them independently from each other and returns multiple columns. I could shoehorn something using concat_str, but that is obviously not as efficient as .unique().

I've been scratching my head about this one.

stinodego commented 1 year ago

So have I, to be honest. It's been on my to-do to figure this out. Maybe @ritchie46 can give some clarity about that specific comment?

ritchie46 commented 1 year ago

We should add is_first support to struct dtypes. Then you can simply wrap the struct and call is_first.

This also is consistent across what we tell everybody to do. If you want to evaluate logic on multiple columns -> wrap it in a struct.

I can pick this up.

DrMaphuse commented 1 year ago

In my head, it would make sense to add pl.is_first(), analogous to pl.sum().

Implementation of is_first for pl.list dtype could also give us a potential solution.

pola-rs / polars

Unique / Duplicate Enhancements #6125

Problem description

Problems / Issues

1. `unique`: add another `keep` option

2. `unique` and `is_unique`: inconsistent results

3. `duplicate` method

4. `is_duplicated`: add `keep` argument

5. `has_duplicates` on `Expr.arr`

Summary: Feature Requests

pola-rs / polars

Unique / Duplicate Enhancements #6125

Problem description

Problems / Issues

1. unique: add another keep option

2. unique and is_unique: inconsistent results

3. duplicate method

4. is_duplicated: add keep argument

5. has_duplicates on Expr.arr

Summary: Feature Requests

1. `unique`: add another `keep` option

2. `unique` and `is_unique`: inconsistent results

3. `duplicate` method

4. `is_duplicated`: add `keep` argument

5. `has_duplicates` on `Expr.arr`