Open Julian-J-S opened 1 year ago
We can filter by columns that are unique: is_unique
.
df = pl.DataFrame({"a": [1, 2, 3, 1]})
df.filter(pl.col("a").is_unique())
We changed the name from distinct
to unique
. This discussion has been had before, maybe distinct
is a better name. I don't think is_unique
and distinct
should have the same results. Maybe we should name the method distinct
to prevent this semantic confusion.
is_unique
should only have one answer, if a column value is unique, meaning there is only one in that set. If you want the first, you combing is_unique
and is_first
, that's the whole idea of expressions: reduce API surface so that you can combine and cherry pick the logic you need.
The same logic applies to is_duplicated
, it should only have a yes or no answer.
Could you make a separate issue for this one? I think we should add this.
I think we can add this options :+1:
is_unique
should only have one answer, if a column value is unique
How can I replicate the subset=
argument of unique()
using is_first
, i.e. evaluate uniqueness across multiple columns? As of now, when I pass multiple columns into is_first
, it evaluates them independently from each other and returns multiple columns. I could shoehorn something using concat_str
, but that is obviously not as efficient as .unique()
.
I've been scratching my head about this one.
So have I, to be honest. It's been on my to-do to figure this out. Maybe @ritchie46 can give some clarity about that specific comment?
We should add is_first
support to struct
dtypes. Then you can simply wrap the struct
and call is_first
.
This also is consistent across what we tell everybody to do. If you want to evaluate logic on multiple columns -> wrap it in a struct.
I can pick this up.
In my head, it would make sense to add pl.is_first()
, analogous to pl.sum()
.
Implementation of is_first
for pl.list
dtype could also give us a potential solution.
Problem description
this is an extension to #5590 with some examples and explanations.
Problems / Issues
unique
: There is no way to remove ALL duplicates like in pandas (currently only possible to keepfirst
/last
of duplicate row)unique
/is_unique
: There is no way to get the same result usingunique
and filtering withis_unique
if duplicates are present which is really awkwardduplicate
: There is no consise/semantic way to get the duplicate rows.duplicate
should be added for completeness (is_unique
+unique
available but onlyis_duplicated
)is_duplicated
: There is nokeep
argument like in pandasduplicated
to specify which duplicates to markhas_duplicates
would be nice onExpr.arr
(currently using some workaround withunqiue
length and original length)1.
unique
: add anotherkeep
optionunique
cannot remove ALL duplicates from the original datakeep=False
2.
unique
andis_unique
: inconsistent resultspolars
unique
andis_unique
NEVER return the same result if duplicates are present which feels awkward:is_unique
should also have thekeep
argument (first
/last
/none
)keep="none"
would be the current behavior ofis_unique
keep="first/last"
is like reading a book and marking the first/last time you see a word3.
duplicate
methodis_unique
+unique
+is_duplicated
available, it feels likeduplicate
is missingduplicate
should have the samesubset
andkeep
arguments asunique
4.
is_duplicated
: addkeep
argumentis_duplicated
is missing thekeep
argumentkeep
argument:5.
has_duplicates
onExpr.arr
has_duplicates
would be nice to have onExpr.arr
Summary: Feature Requests
unique
: addkeep="none"
(orNone/False
) to remove all duplicates (available in pandasdrop_duplicates
)is_unique
: addkeep
argument to matchunique
and get consistent resultsduplicate
: add this method to complementis_unique
,unique
andis_duplicated
is_duplicated
: addkeep
argument to matchunique
(available in pandasduplicated
)Expr.arr.has_duplicates
: add this to get fast/efficient boolean mask if list contains duplicates