Unstable hashtable / duplicated algo for object dtype

jorisvandenbossche commented 5 years ago

From a flaky test in geopandas, I observed the following behaviour:

In [1]: pd.__version__
Out[1]: '0.25.0.dev0+791.gf0919f272'

In [2]: from shapely.geometry import Point 

In [3]: a = np.array([Point(1, 1), Point(1, 1)], dtype=object) 

In [4]: pd.Series(a).duplicated()
Out[4]: 
0    False
1     True
dtype: bool

In [6]: print(pd.Series(a).duplicated()) 
   ...: print(pd.Series(a).duplicated())
0    False
1     True
dtype: bool
0    False
1    False
dtype: bool

So you see that sometimes it works, sometimes it does not work.

I am also not fully sure how the object hashtable works (assuming duplicated uses the hashtable), as the shapely Point objects are not hashable:

In [9]: pd.Series(a).unique()
...
TypeError: unhashable type: 'Point'

jorisvandenbossche commented 5 years ago

In the class methods, we check that the object is hashable (before doing a kh_put_pymap). Eg in PyObjectHashTable.unique:

https://github.com/pandas-dev/pandas/blob/f0919f272d9614058b5ebb5e0664d1ac6f23540f/pandas/_libs/hashtable_class_helper.pxi.in#L1019-L1034

while in the function implementations (such as duplicated), we don't do that check. Eg

https://github.com/pandas-dev/pandas/blob/f0919f272d9614058b5ebb5e0664d1ac6f23540f/pandas/_libs/hashtable_func_helper.pxi.in#L153-L156

The "pymap" hashtable needs the values to be hashable, but doesn't actually check that? Should we add a hash(val) in those functions as well?

cc @jreback

jreback commented 5 years ago

@jorisvandenbossche yeah you could try that here

but this may introduce a performance issue

jorisvandenbossche commented 5 years ago

From a quick test, it seems that adding a hash(val) in the for loop to create the hash gives a 20% slowdown (45ms -> 55ms for a string series of 3 million elements).

So that's a considerable slowdown.

A "proper" fix might be to check in the actual khash C code for a return value of -1 of PyObject_Hash, but I would rather not start meddling with that implementation.

Alternative could be doing a "best effort" check by hashing the first element. So if you have a full object series of unhashable objects, that would at least be catched. But it is not that nice that it depends on the order of the values if the error is raised or not (eg if you have a missing value in the first location).

simonjayhawkins commented 2 years ago

as the shapely Point objects are not hashable:

xref #12693, where other unhashable objects raise (which is probably the correct behavior with the current implementation or perhaps raise a NotImplementedError and #12693 should maybe an enhancement request instead.)

however, the code in the OP is no longer unstable AFAICT

simonjayhawkins commented 2 years ago

xref #12693, where other unhashable objects raise

indeed it does with a DataFrame with more than one column

pd.Series(a).to_frame().assign(dup=lambda x: x[0]).duplicated()

``` --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [18], in () ----> 1 pd.Series(a).to_frame().assign(dup=lambda x: x[0]).duplicated() File ~/pandas/pandas/core/frame.py:6527, in DataFrame.duplicated(self, subset, keep) 6525 else: 6526 vals = (col.values for name, col in self.items() if name in subset) -> 6527 labels, shape = map(list, zip(*map(f, vals))) 6529 ids = get_group_index( 6530 labels, 6531 # error: Argument 1 to "tuple" has incompatible type "List[_T]"; (...) 6535 xnull=False, 6536 ) 6537 result = self._constructor_sliced(duplicated(ids, keep), index=self.index) File ~/pandas/pandas/core/frame.py:6495, in DataFrame.duplicated..f(vals) 6494 def f(vals) -> tuple[np.ndarray, int]: -> 6495 labels, shape = algorithms.factorize(vals, size_hint=len(self)) 6496 return labels.astype("i8", copy=False), len(shape) File ~/pandas/pandas/core/algorithms.py:738, in factorize(values, sort, na_sentinel, size_hint) 736 else: 737 values = np.asarray(values) # convert DTA/TDA/MultiIndex --> 738 codes, uniques = factorize_array( 739 values, na_sentinel=na_sentinel, size_hint=size_hint 740 ) 742 if sort and len(uniques) > 0: 743 uniques, codes = safe_sort( 744 uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False 745 ) File ~/pandas/pandas/core/algorithms.py:547, in factorize_array(values, na_sentinel, size_hint, na_value, mask) 544 hash_klass, values = _get_hashtable_algo(values) 546 table = hash_klass(size_hint or len(values)) --> 547 uniques, codes = table.factorize( 548 values, na_sentinel=na_sentinel, na_value=na_value, mask=mask 549 ) 551 # re-cast e.g. i8->dt64/td64, uint8->bool 552 uniques = _reconstruct_data(uniques, original.dtype, original) File ~/pandas/pandas/_libs/hashtable_class_helper.pxi:5303, in pandas._libs.hashtable.PyObjectHashTable.factorize() File ~/pandas/pandas/_libs/hashtable_class_helper.pxi:5219, in pandas._libs.hashtable.PyObjectHashTable._unique() TypeError: unhashable type: 'Point' ```

simonjayhawkins commented 2 years ago

however, the code in the OP is no longer unstable AFAICT

looks like fixed sometime after 1.2.5

pandas-dev / pandas

Unstable hashtable / duplicated algo for object dtype #27035