pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.33k stars 17.81k forks source link

df.duplicated and drop_duplicates raise TypeError with unhashable values. #12693

Open Abrosimov-a-a opened 8 years ago

Abrosimov-a-a commented 8 years ago

IN:

import pandas as pd
df = pd.DataFrame([[{'a', 'b'}], [{'b','c'}], [{'b', 'a'}]])
df

OUT:

    0
0   {a, b}
1   {c, b}
2   {a, b}

IN:

df.duplicated()

OUT:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-77-7cc63ba1ed41> in <module>()
----> 1 df.duplicated()

venv/lib/python3.5/site-packages/pandas/util/decorators.py in wrapper(*args, **kwargs)
     89                 else:
     90                     kwargs[new_arg_name] = new_arg_value
---> 91             return func(*args, **kwargs)
     92         return wrapper
     93     return _deprecate_kwarg

venv/lib/python3.5/site-packages/pandas/core/frame.py in duplicated(self, subset, keep)
   3100 
   3101         vals = (self[col].values for col in subset)
-> 3102         labels, shape = map(list, zip(*map(f, vals)))
   3103 
   3104         ids = get_group_index(labels, shape, sort=False, xnull=False)

TypeError: type object argument after * must be a sequence, not map

I expect:

0    False
1    False
2     True
dtype: bool

pd.show_versions() output:

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.3.0-1-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: ru_RU.UTF-8

pandas: 0.18.0
nose: None
pip: 1.5.6
setuptools: 18.8
Cython: None
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: None
patsy: None
dateutil: 2.5.1
pytz: 2016.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
jreback commented 8 years ago

I guess. you are using a list-like value INSIDE a cell of a frame. This is quite inefficient and not generally supported. pull-requests accepts to fix in any event.

kokes commented 5 years ago

Current pandas gives a slightly different TypeError (TypeError: unhashable type: 'set'), which does get to the point - how would you deduplicate sets or lists? Unlike tuples and primitive types, these are not hashable (sets could be converted to frozensets, which are hashable), so you have to come up with a deduplication strategy.

In any case, since you're dealing with an object dtype, there is no guarantee that the next row won't contain a set or a list, so this deduplication gets only worse from then on. So pandas treats each value as a separate one and processes them as long as they are hashable. Just try a column with three tuples, it will work, then change the last one to be a set and it will fail on that very value.

So, I'm not sure there's a solid implementation that would work here given the lack of hashability in lists, there could potentially be a fix for sets, which would be converted to frozensets upon hash map insertion, but that does seem hacky and arbitrary.

itamar-precog commented 4 years ago

How about ignoring unhashable columns for the purposes of dropping duplicates? Like adding a kwarg 'unhashable_type' whose default is 'raise' (which works as current), but can be set to 'ignore' (at the risk of dropping rows which aren't entirely duplicated).

simonjayhawkins commented 2 years ago

The case in the OP is fixed on main

print(pd.__version__)
df = pd.DataFrame([[{"a", "b"}], [{"b", "c"}], [{"b", "a"}]])
print(df.duplicated())
print(df.drop_duplicates())
1.5.0.dev0+867.gdf8acf4201
0    False
1    False
2     True
dtype: bool
        0
0  {a, b}
1  {b, c}

and for lists too

df = pd.DataFrame([[["a", "b"]], [["b"]], [["a", "b"]]])
print(df.duplicated())
print(df.drop_duplicates())
0    False
1    False
2     True
dtype: bool
        0
0  [a, b]
1     [b]

fixed in commit: [235113e67065320b3ec0176421d5c397d30ad886] PERF: Improve performance for df.duplicated with one column subset (#45534)

but will still fail for multi-column DataFrame

print(pd.__version__)
df = pd.DataFrame([[{"a", "b"}], [{"b", "c"}], [{"b", "a"}]]).T
print(df.duplicated())
TypeError: unhashable type: 'set'
MichaelTiemannOSC commented 1 year ago

I have a test case that also throws this error, when trying to use uncertainties in anything other than a Series (or one-column DataFrame):

import pandas as pd
import uncertainties as un
import pint
from pint import Quantity as Q_
import pint_pandas

x = pd.DataFrame({'a': [un.ufloat(1.0, 0.0), un.ufloat(1.0, 0.0)]})

if len(x) == len(x.drop_duplicates())+1:
    print("simple comparison of ufloats, works")
else:
    print("simple comparison of ufloats failed")
    assert False

x = pd.DataFrame({'a': [un.ufloat(1.0, 0.0)*2+1, un.ufloat(1.0, 0.0)*2+1]})

if len(x) == len(x.drop_duplicates())+1:
    print("comparison of Affine Scalar values (simple or with quantity meters) works")
else:
    print("comparison of Affine Scalar values (simple or with quantity meters) failed")
    assert False

x = pd.DataFrame({'a': [Q_(un.ufloat(1.0, 0.0), 'm'), Q_(un.ufloat(1.0, 0.0), 'm')]})

if not x.compare(x.drop_duplicates()).empty:
    print("simple comparison of ufloat meters works")
else:
    print("simple comparison of ufloat meters, failed")

x = pd.DataFrame({'a': [un.ufloat(1.0, 0.0)*2+1, un.ufloat(1.0, 0.0)],
                  'b': [un.ufloat(2.0, 0.0)*2+1, un.ufloat(2.0, 0.0)]})

if not x.compare(x.drop_duplicates()).empty:
    print("comparison of Affine Scalar values (multi-column) works")
else:
    print("comparison of Affine Scalar values (multi-column) failed")

Not only does the third case fail (using a combination of uncertainties and quantities), but the fourth case fails with the aforementioned TypeError:

Traceback (most recent call last):
  File "pandas-dropdups.py", line 33, in <module>
    if not x.compare(x.drop_duplicates()).empty:
  File "python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "python3.9/site-packages/pandas/core/frame.py", line 6669, in drop_duplicates
    duplicated = self.duplicated(subset, keep=keep)
  File "python3.9/site-packages/pandas/core/frame.py", line 6811, in duplicated
    labels, shape = map(list, zip(*map(f, vals)))
  File "python3.9/site-packages/pandas/core/frame.py", line 6779, in f
    labels, shape = algorithms.factorize(vals, size_hint=len(self))
  File "python3.9/site-packages/pandas/core/algorithms.py", line 818, in factorize
    codes, uniques = factorize_array(
  File "python3.9/site-packages/pandas/core/algorithms.py", line 574, in factorize_array
    uniques, codes = table.factorize(
  File "pandas/_libs/hashtable_class_helper.pxi", line 5943, in pandas._libs.hashtable.PyObjectHashTable.factorize
  File "pandas/_libs/hashtable_class_helper.pxi", line 5857, in pandas._libs.hashtable.PyObjectHashTable._unique
TypeError: unhashable type: 'AffineScalarFunc'

AffineScalarFunc is a synonym for UFloat from the uncertainties package. It results from a ufloat(nominal_value, error_value) having math done to it, making it Affine and no longer simply a ufloat.