Open Abrosimov-a-a opened 8 years ago
I guess. you are using a list-like value INSIDE a cell of a frame. This is quite inefficient and not generally supported. pull-requests accepts to fix in any event.
Current pandas gives a slightly different TypeError (TypeError: unhashable type: 'set'
), which does get to the point - how would you deduplicate sets or lists? Unlike tuples and primitive types, these are not hashable (sets could be converted to frozensets, which are hashable), so you have to come up with a deduplication strategy.
In any case, since you're dealing with an object dtype, there is no guarantee that the next row won't contain a set or a list, so this deduplication gets only worse from then on. So pandas treats each value as a separate one and processes them as long as they are hashable. Just try a column with three tuples, it will work, then change the last one to be a set and it will fail on that very value.
So, I'm not sure there's a solid implementation that would work here given the lack of hashability in lists, there could potentially be a fix for sets, which would be converted to frozensets upon hash map insertion, but that does seem hacky and arbitrary.
How about ignoring unhashable columns for the purposes of dropping duplicates? Like adding a kwarg 'unhashable_type' whose default is 'raise' (which works as current), but can be set to 'ignore' (at the risk of dropping rows which aren't entirely duplicated).
The case in the OP is fixed on main
print(pd.__version__)
df = pd.DataFrame([[{"a", "b"}], [{"b", "c"}], [{"b", "a"}]])
print(df.duplicated())
print(df.drop_duplicates())
1.5.0.dev0+867.gdf8acf4201
0 False
1 False
2 True
dtype: bool
0
0 {a, b}
1 {b, c}
and for lists too
df = pd.DataFrame([[["a", "b"]], [["b"]], [["a", "b"]]])
print(df.duplicated())
print(df.drop_duplicates())
0 False
1 False
2 True
dtype: bool
0
0 [a, b]
1 [b]
fixed in commit: [235113e67065320b3ec0176421d5c397d30ad886] PERF: Improve performance for df.duplicated with one column subset (#45534)
but will still fail for multi-column DataFrame
print(pd.__version__)
df = pd.DataFrame([[{"a", "b"}], [{"b", "c"}], [{"b", "a"}]]).T
print(df.duplicated())
TypeError: unhashable type: 'set'
I have a test case that also throws this error, when trying to use uncertainties in anything other than a Series (or one-column DataFrame):
import pandas as pd
import uncertainties as un
import pint
from pint import Quantity as Q_
import pint_pandas
x = pd.DataFrame({'a': [un.ufloat(1.0, 0.0), un.ufloat(1.0, 0.0)]})
if len(x) == len(x.drop_duplicates())+1:
print("simple comparison of ufloats, works")
else:
print("simple comparison of ufloats failed")
assert False
x = pd.DataFrame({'a': [un.ufloat(1.0, 0.0)*2+1, un.ufloat(1.0, 0.0)*2+1]})
if len(x) == len(x.drop_duplicates())+1:
print("comparison of Affine Scalar values (simple or with quantity meters) works")
else:
print("comparison of Affine Scalar values (simple or with quantity meters) failed")
assert False
x = pd.DataFrame({'a': [Q_(un.ufloat(1.0, 0.0), 'm'), Q_(un.ufloat(1.0, 0.0), 'm')]})
if not x.compare(x.drop_duplicates()).empty:
print("simple comparison of ufloat meters works")
else:
print("simple comparison of ufloat meters, failed")
x = pd.DataFrame({'a': [un.ufloat(1.0, 0.0)*2+1, un.ufloat(1.0, 0.0)],
'b': [un.ufloat(2.0, 0.0)*2+1, un.ufloat(2.0, 0.0)]})
if not x.compare(x.drop_duplicates()).empty:
print("comparison of Affine Scalar values (multi-column) works")
else:
print("comparison of Affine Scalar values (multi-column) failed")
Not only does the third case fail (using a combination of uncertainties and quantities), but the fourth case fails with the aforementioned TypeError:
Traceback (most recent call last):
File "pandas-dropdups.py", line 33, in <module>
if not x.compare(x.drop_duplicates()).empty:
File "python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "python3.9/site-packages/pandas/core/frame.py", line 6669, in drop_duplicates
duplicated = self.duplicated(subset, keep=keep)
File "python3.9/site-packages/pandas/core/frame.py", line 6811, in duplicated
labels, shape = map(list, zip(*map(f, vals)))
File "python3.9/site-packages/pandas/core/frame.py", line 6779, in f
labels, shape = algorithms.factorize(vals, size_hint=len(self))
File "python3.9/site-packages/pandas/core/algorithms.py", line 818, in factorize
codes, uniques = factorize_array(
File "python3.9/site-packages/pandas/core/algorithms.py", line 574, in factorize_array
uniques, codes = table.factorize(
File "pandas/_libs/hashtable_class_helper.pxi", line 5943, in pandas._libs.hashtable.PyObjectHashTable.factorize
File "pandas/_libs/hashtable_class_helper.pxi", line 5857, in pandas._libs.hashtable.PyObjectHashTable._unique
TypeError: unhashable type: 'AffineScalarFunc'
AffineScalarFunc is a synonym for UFloat from the uncertainties package. It results from a ufloat(nominal_value, error_value) having math done to it, making it Affine and no longer simply a ufloat.
IN:
OUT:
IN:
OUT:
I expect:
pd.show_versions()
output: