pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.42k stars 17.84k forks source link

ENH: allow EAs to override MergeOperation._get_join_indexers #53696

Open jbrockmendel opened 1 year ago

jbrockmendel commented 1 year ago

AsOfMerge._get_join_indexers calls to_numpy() on EAs, which can be costly. _MergeOperation._get_join_indexers is a bit more forgiving, using _values_for_factorize in _factorize_keys. The latter still requires a cast to numpy, which makes this a non-starter for (hypothetical) distributed/GPU EAs.

Can we push this up into an EA method that can be potentially overridden? This might be a use case for @mroeschke 's "ExtensionManager" idea.

MichaelTiemannOSC commented 1 year ago

As an FYI, my EA (which extends PintArray to support uncertain magnitudes, https://github.com/hgrecco/pint-pandas/pull/140) is getting hung up on a mismatch between what I understood to be the interface to _values_from_factorize and what rizer._value_from_factorize is returning. My grief comes from here (pandas/tests/extension/base/reshaping.py):

    def test_merge_on_extension_array_duplicates(self, data):
    # GH 23020                                                                                                                                                                                                                                                                                                      
        a, b = data[:2]
    key = type(data)._from_sequence([a, b, a], dtype=data.dtype)
    df1 = pd.DataFrame({"key": key, "val": [1, 2, 3]})
        df2 = pd.DataFrame({"key": key, "val": [1, 2, 3]})

        result = pd.merge(df1, df2, on="key")
        expected = pd.DataFrame(
            {
                "key": key.take([0, 0, 0, 0, 1]),
                "val_x": [1, 1, 3, 3, 2],
        "val_y": [1, 3, 1, 3, 2],
            }
        )
        self.assert_frame_equal(result, expected)

data is [1.0, 2.0, 1.0] for both df1 and df2 we get

   key  val
0  1.0    1
1  2.0    2
2  1.0    3

Down in the merge we have:

> /Users/michael/Documents/GitHub/MichaelTiemannOSC/pandas-dev/pandas/core/reshape/merge.py(2396)_factorize_keys()
-> isinstance(lk.dtype, DatetimeTZDtype) and isinstance(rk.dtype, DatetimeTZDtype)
(Pdb) lk  # rk is the same
<PintArray>
[1.0, 2.0, 1.0]
Length: 3, dtype: pint[nanometer]
(Pdb) 

and the obviously NONUNIQUE values here:

(Pdb) lk._values_for_factorize()  # rk._values_for_factorize() is the same
(array([1.0, 2.0, 1.0], dtype=object), nan)

My understanding was that values returned by _values_for_factorize should be unique (and should not contain the na_value if use_na_sentinel is true). When I allow my _values_for_factorize to return duplicated values, other tests fail.

Clearly this code (from pandas/core/arrays/base.py) is doing nothing to unique-ify the values of the data, though the documentation snippet does refer to uniques (which are not present in the code):

    def _values_for_factorize(self) -> tuple[np.ndarray, Any]:
        """                                                                                                                                                                                                                                                                                                             
        Return an array and missing value suitable for factorization.                                                                                                                                                                                                                                                   

        Returns                                                                                                                                                                                                                                                                                                         
        -------                                                                                                                                                                                                                                                                                                         
        values : ndarray                                                                                                                                                                                                                                                                                                
            An array suitable for factorization. This should maintain order                                                                                                                                                                                                                                             
            and be a supported dtype (Float64, Int64, UInt64, String, Object).                                                                                                                                                                                                                                          
            By default, the extension array is cast to object dtype.                                                                                                                                                                                                                                                    
        na_value : object                                                                                                                                                                                                                                                                                               
            The value in `values` to consider missing. This will be treated                                                                                                                                                                                                                                             
            as NA in the factorization routines, so it will be coded as                                                                                                                                                                                                                                                 
            `-1` and not included in `uniques`. By default,                                                                                                                                                                                                                                                             
            ``np.nan`` is used.                                                                                                                                                                                                                                                                                         

        Notes                                                                                                                                                                                                                                                                                                           
        -----                                                                                                                                                                                                                                                                                                           
        The values returned by this method are also used in                                                                                                                                                                                                                                                             
        :func:`pandas.util.hash_pandas_object`. If needed, this can be                                                                                                                                                                                                                                                  
        overridden in the ``self._hash_pandas_object()`` method.                                                                                                                                                                                                                                                        
        """
        return self.astype(object), np.nan

Help?

jbrockmendel commented 1 year ago

My understanding was that values returned by _values_for_factorize should be unique (and should not contain the na_value if use_na_sentinel is true). When I allow my _values_for_factorize to return duplicated values, other tests fail.

An EA's _values_for_factorize does not need to be unique. The reference to "uniques" in the _values_for_factorize docstring is about what is returned by factorize. I agree this is confusing, and in fact am of the opinion that _values_for_factorize is a bad pattern that should go in general (xref #53501). As for things failing when you make values_for_factorize return something non-unique, let's find a dedicated place to discuss that. Is pint-pandas#140 appropriate for that?

MichaelTiemannOSC commented 1 year ago

I got it sorted by aligning the non-unique _values_for_factorize with my EA factorize code that does the unique-ification itself. All good. I'm now passing. Please close.