pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.94k stars 1.93k forks source link

Unique / Duplicate Enhancements #6125

Open Julian-J-S opened 1 year ago

Julian-J-S commented 1 year ago

Problem description

this is an extension to #5590 with some examples and explanations.

Problems / Issues

  1. unique: There is no way to remove ALL duplicates like in pandas (currently only possible to keep first/last of duplicate row)
  2. unique / is_unique: There is no way to get the same result using unique and filtering with is_unique if duplicates are present which is really awkward
  3. duplicate: There is no consise/semantic way to get the duplicate rows. duplicate should be added for completeness (is_unique + unique available but only is_duplicated)
  4. is_duplicated: There is no keep argument like in pandas duplicated to specify which duplicates to mark
  5. There is no fast / elegant way to get boolean mask of a column containing lists with duplicates. has_duplicates would be nice on Expr.arr (currently using some workaround with unqiue length and original length)

1. unique: add another keep option

2. unique and is_unique: inconsistent results

df.is_unique(keep="none")  # default / current behavior
> [False, True, True, False]
df.is_unique(keep="first")
> [True, True, True, False]
df.is_unique(keep="last")
> [False, True, True, True]

3. duplicate method

df = pl.DataFrame({"a": [1, 2, 3, 2, 1]})
df.duplicate(keep="all")
> [1, 2, 2, 1]
df.duplicate(keep="first")
> [1, 2]
df.duplicate(keep="last")
> [2, 1]

4. is_duplicated: add keep argument

df = pl.DataFrame({"a": [1, 2, 3, 2, 1]})
df.is_duplicated()
> [True, True, False, True, True]
df.is_duplicated(keep="all")  # default
> [True, True, False, True, True]
df.is_duplicated(keep="first")
> [True, True, False, False, False]
df.is_duplicated(keep="last")
> [False, False, False, True, True]

5. has_duplicates on Expr.arr

df = pl.DataFrame({'a': [[1, 2, 1], [3, 4, 5], [6, 6]]})
df_arr.filter(
    pl.col('a').arr.lengths()
    != pl.col('a').arr.unique().arr.lengths()
)
> [[1, 2, 1], [6, 6]]
df_arr.filter(
    pl.col('a').arr.has_duplicates()
)

Summary: Feature Requests

  1. unique: add keep="none" (or None/False) to remove all duplicates (available in pandas drop_duplicates)
  2. is_unique: add keep argument to match unique and get consistent results
  3. duplicate: add this method to complement is_unique, unique and is_duplicated
  4. is_duplicated: add keep argument to match unique (available in pandas duplicated)
  5. Expr.arr.has_duplicates: add this to get fast/efficient boolean mask if list contains duplicates
ritchie46 commented 1 year ago

We can filter by columns that are unique: is_unique.

df = pl.DataFrame({"a": [1, 2, 3, 1]})
df.filter(pl.col("a").is_unique())

We changed the name from distinct to unique. This discussion has been had before, maybe distinct is a better name. I don't think is_unique and distinct should have the same results. Maybe we should name the method distinct to prevent this semantic confusion.

is_unique should only have one answer, if a column value is unique, meaning there is only one in that set. If you want the first, you combing is_unique and is_first, that's the whole idea of expressions: reduce API surface so that you can combine and cherry pick the logic you need.

The same logic applies to is_duplicated, it should only have a yes or no answer.

Could you make a separate issue for this one? I think we should add this.

I think we can add this options :+1:

DrMaphuse commented 1 year ago

is_unique should only have one answer, if a column value is unique

How can I replicate the subset= argument of unique() using is_first, i.e. evaluate uniqueness across multiple columns? As of now, when I pass multiple columns into is_first, it evaluates them independently from each other and returns multiple columns. I could shoehorn something using concat_str, but that is obviously not as efficient as .unique().

I've been scratching my head about this one.

stinodego commented 1 year ago

So have I, to be honest. It's been on my to-do to figure this out. Maybe @ritchie46 can give some clarity about that specific comment?

ritchie46 commented 1 year ago

We should add is_first support to struct dtypes. Then you can simply wrap the struct and call is_first.

This also is consistent across what we tell everybody to do. If you want to evaluate logic on multiple columns -> wrap it in a struct.

I can pick this up.

DrMaphuse commented 1 year ago

In my head, it would make sense to add pl.is_first(), analogous to pl.sum().

Implementation of is_first for pl.list dtype could also give us a potential solution.