pytoolz / toolz

A functional standard library for Python.
http://toolz.readthedocs.org/
Other
4.66k stars 259 forks source link

Add nonunique. #506

Open groutr opened 3 years ago

groutr commented 3 years ago

itertoolz.unique yields the never before seen elements of sequence. nonunique is the complement, yielding the already seen elements of a sequence.

This is incredibly useful for finding duplicates in a sequence.

>>> tuple(nonunique([1, 2, 3, 4, 5, 1, 2, 3]))
(1, 2, 3)

This isn't really a new feature to itertoolz, but instead exposes an already existing feature. isdistinct already had this logic, but instead of returning True/False, I return the already seen elements as they are encountered. This PR simply moves the logic into its own function.

ping: @eriknw

groutr commented 3 years ago

@eriknw Can I get your thoughts on this?

eriknw commented 2 years ago

Thanks @groutr! Everything here looks reasonable and good. I'm curious: do you have a use case for this?

And sorry for my delay. This year has been, uh, a little crazy.

groutr commented 2 years ago

I'm sure that I had a better use case when I created the PR that I cannot recall now.

One use case that currently comes to mind: when I'm asking "is this distinct", many times I'm really meaning to ask "why isn't this distinct"? If isdistinct is False, it can be natural to wonder what the duplicated elements are. Pandas has duplicated and now toolz can also be used.

eriknw commented 2 years ago

Yeah, that sounds reasonable.

groutr commented 2 years ago

@eriknw which name do you find easier to remember? toolz.duplicated (toolz.duplicates?) or toolz.nonunique

groutr commented 2 years ago

I think I prefer the name nonunique as we don't produce a mask like pd.duplicated.

groutr commented 2 years ago

I think this is ready. What do you think @eriknw?