pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.7k stars 17.91k forks source link

BUG: Test failures on 32-bit x86 with pyarrow installed #57523

Open mgorny opened 8 months ago

mgorny commented 8 months ago

Pandas version checks

Reproducible Example

python -m pytest # ;-)

Issue Description

When running the test suite on 32-bit x86 with pyarrow installed, I'm getting the following test failures (compared to a run without pyarrow):

FAILED pandas/tests/indexes/numeric/test_indexing.py::TestGetIndexer::test_get_indexer_arrow_dictionary_target - AssertionError: numpy array are different
FAILED pandas/tests/interchange/test_impl.py::test_large_string_pyarrow - OverflowError: Python int too large to convert to C ssize_t
FAILED pandas/tests/frame/methods/test_join.py::test_join_on_single_col_dup_on_right[string[pyarrow]] - ValueError: putmask: output array is read-only
FAILED pandas/tests/reshape/merge/test_multi.py::TestMergeMulti::test_left_join_multi_index[False-True] - ValueError: putmask: output array is read-only
Tracebacks ```pytb _______________________________________ TestGetIndexer.test_get_indexer_arrow_dictionary_target _______________________________________ [gw8] linux -- Python 3.11.7 /var/tmp/portage/dev-python/pandas-2.2.0-r1/work/pandas-2.2.0-python3_11/install/usr/bin/python3.11 self = def test_get_indexer_arrow_dictionary_target(self): pa = pytest.importorskip("pyarrow") target = Index( ArrowExtensionArray( pa.array([1, 2], type=pa.dictionary(pa.int8(), pa.int8())) ) ) idx = Index([1]) result = idx.get_indexer(target) expected = np.array([0, -1], dtype=np.int64) > tm.assert_numpy_array_equal(result, expected) E AssertionError: numpy array are different E E Attribute "dtype" are different E [left]: int32 E [right]: int64 expected = array([ 0, -1], dtype=int64) idx = Index([1], dtype='int64') pa = result = array([ 0, -1], dtype=int32) self = target = Index([1, 2], dtype='dictionary[pyarrow]') pandas/tests/indexes/numeric/test_indexing.py:406: AssertionError ______________________________________________________ test_large_string_pyarrow ______________________________________________________ [gw8] linux -- Python 3.11.7 /var/tmp/portage/dev-python/pandas-2.2.0-r1/work/pandas-2.2.0-python3_11/install/usr/bin/python3.11 def test_large_string_pyarrow(): # GH 52795 pa = pytest.importorskip("pyarrow", "11.0.0") arr = ["Mon", "Tue"] table = pa.table({"weekday": pa.array(arr, "large_string")}) exchange_df = table.__dataframe__() result = from_dataframe(exchange_df) expected = pd.DataFrame({"weekday": ["Mon", "Tue"]}) tm.assert_frame_equal(result, expected) # check round-trip > assert pa.Table.equals(pa.interchange.from_dataframe(result), table) arr = ['Mon', 'Tue'] exchange_df = expected = weekday 0 Mon 1 Tue pa = result = weekday 0 Mon 1 Tue table = pyarrow.Table weekday: large_string ---- weekday: [["Mon","Tue"]] pandas/tests/interchange/test_impl.py:104: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113: in from_dataframe return _from_dataframe(df.__dataframe__(allow_copy=allow_copy), allow_copy = True df = weekday 0 Mon 1 Tue /usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136: in _from_dataframe batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy) allow_copy = True batches = [] chunk = df = /usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182: in protocol_df_chunk_to_pyarrow columns[name] = column_to_array(col, allow_copy) allow_copy = True col = columns = {} df = dtype = name = 'weekday' /usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214: in column_to_array data = buffers_to_array(buffers, data_type, allow_copy = True buffers = {'data': (PandasBuffer({'bufsize': 6, 'ptr': 3445199088, 'device': 'CPU'}), (, 8, 'u', '=')), 'offsets': (PandasBuffer({'bufsize': 24, 'ptr': 1546049072, 'device': 'CPU'}), (, 64, 'l', '=')), 'validity': (PandasBuffer({'bufsize': 2, 'ptr': 1544334624, 'device': 'CPU'}), (, 8, 'b', '='))} col = data_type = (, 8, 'u', '=') /usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396: in buffers_to_array data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize, _ = (, 8, 'u', '=') allow_copy = True buffers = {'data': (PandasBuffer({'bufsize': 6, 'ptr': 3445199088, 'device': 'CPU'}), (, 8, 'u', '=')), 'offsets': (PandasBuffer({'bufsize': 24, 'ptr': 1546049072, 'device': 'CPU'}), (, 64, 'l', '=')), 'validity': (PandasBuffer({'bufsize': 2, 'ptr': 1544334624, 'device': 'CPU'}), (, 8, 'b', '='))} data_buff = PandasBuffer({'bufsize': 6, 'ptr': 3445199088, 'device': 'CPU'}) data_type = (, 8, 'u', '=') describe_null = (, 0) length = 2 offset = 0 offset_buff = PandasBuffer({'bufsize': 24, 'ptr': 1546049072, 'device': 'CPU'}) offset_dtype = (, 64, 'l', '=') validity_buff = PandasBuffer({'bufsize': 2, 'ptr': 1544334624, 'device': 'CPU'}) validity_dtype = (, 8, 'b', '=') _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E OverflowError: Python int too large to convert to C ssize_t pyarrow/io.pxi:1990: OverflowError ________________________________________ test_join_on_single_col_dup_on_right[string[pyarrow]] ________________________________________ [gw5] linux -- Python 3.11.7 /var/tmp/portage/dev-python/pandas-2.2.0-r1/work/pandas-2.2.0-python3_11/install/usr/bin/python3.11 left_no_dup = a b 0 a cat 1 b dog 2 c weasel 3 d horse right_w_dups = c a meow bark um... weasel noise? nay chirp e moo dtype = 'string[pyarrow]' @pytest.mark.parametrize("dtype", ["object", "string[pyarrow]"]) def test_join_on_single_col_dup_on_right(left_no_dup, right_w_dups, dtype): # GH 46622 # Dups on right allowed by one_to_many constraint if dtype == "string[pyarrow]": pytest.importorskip("pyarrow") left_no_dup = left_no_dup.astype(dtype) right_w_dups.index = right_w_dups.index.astype(dtype) > left_no_dup.join( right_w_dups, on="a", validate="one_to_many", ) dtype = 'string[pyarrow]' left_no_dup = a b 0 a cat 1 b dog 2 c weasel 3 d horse right_w_dups = c a meow bark um... weasel noise? nay chirp e moo pandas/tests/frame/methods/test_join.py:169: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pandas/core/frame.py:10730: in join return merge( concat = how = 'left' lsuffix = '' merge = on = 'a' other = c a meow bark um... weasel noise? nay chirp e moo rsuffix = '' self = a b 0 a cat 1 b dog 2 c weasel 3 d horse sort = False validate = 'one_to_many' pandas/core/reshape/merge.py:184: in merge return op.get_result(copy=copy) copy = None how = 'left' indicator = False left = a b 0 a cat 1 b dog 2 c weasel 3 d horse left_df = a b 0 a cat 1 b dog 2 c weasel 3 d horse left_index = False left_on = 'a' on = None op = right = c a meow bark um... weasel noise? nay chirp e moo right_df = c a meow bark um... weasel noise? nay chirp e moo right_index = True right_on = None sort = False suffixes = ('', '') validate = 'one_to_many' pandas/core/reshape/merge.py:886: in get_result join_index, left_indexer, right_indexer = self._get_join_info() copy = None self = pandas/core/reshape/merge.py:1142: in _get_join_info join_index, left_indexer, right_indexer = _left_join_on_index( left_ax = RangeIndex(start=0, stop=4, step=1) right_ax = Index([, , , , , 'e'], dtype='string', name='a') self = pandas/core/reshape/merge.py:2385: in _left_join_on_index left_key, right_key, count = _factorize_keys(lkey, rkey, sort=sort) join_keys = [ ['a', 'b', 'c', 'd'] Length: 4, dtype: string] left_ax = RangeIndex(start=0, stop=4, step=1) lkey = ['a', 'b', 'c', 'd'] Length: 4, dtype: string right_ax = Index([, , , , , 'e'], dtype='string', name='a') rkey = [, , , , , 'e'] Length: 6, dtype: string sort = False _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ lk = [ [ "a", "b", "c", "d" ] ] rk = [ [ null, null, null, null, null, "e" ] ] sort = False def _factorize_keys( lk: ArrayLike, rk: ArrayLike, sort: bool = True ) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp], int]: """ Encode left and right keys as enumerated types. This is used to get the join indexers to be used when merging DataFrames. Parameters ---------- lk : ndarray, ExtensionArray Left key. rk : ndarray, ExtensionArray Right key. sort : bool, defaults to True If True, the encoding is done such that the unique elements in the keys are sorted. Returns ------- np.ndarray[np.intp] Left (resp. right if called with `key='right'`) labels, as enumerated type. np.ndarray[np.intp] Right (resp. left if called with `key='right'`) labels, as enumerated type. int Number of unique elements in union of left and right labels. See Also -------- merge : Merge DataFrame or named Series objects with a database-style join. algorithms.factorize : Encode the object as an enumerated type or categorical variable. Examples -------- >>> lk = np.array(["a", "c", "b"]) >>> rk = np.array(["a", "c"]) Here, the unique values are `'a', 'b', 'c'`. With the default `sort=True`, the encoding will be `{0: 'a', 1: 'b', 2: 'c'}`: >>> pd.core.reshape.merge._factorize_keys(lk, rk) (array([0, 2, 1]), array([0, 2]), 3) With the `sort=False`, the encoding will correspond to the order in which the unique elements first appear: `{0: 'a', 1: 'c', 2: 'b'}`: >>> pd.core.reshape.merge._factorize_keys(lk, rk, sort=False) (array([0, 1, 2]), array([0, 1]), 3) """ # TODO: if either is a RangeIndex, we can likely factorize more efficiently? if ( isinstance(lk.dtype, DatetimeTZDtype) and isinstance(rk.dtype, DatetimeTZDtype) ) or (lib.is_np_dtype(lk.dtype, "M") and lib.is_np_dtype(rk.dtype, "M")): # Extract the ndarray (UTC-localized) values # Note: we dont need the dtypes to match, as these can still be compared lk, rk = cast("DatetimeArray", lk)._ensure_matching_resos(rk) lk = cast("DatetimeArray", lk)._ndarray rk = cast("DatetimeArray", rk)._ndarray elif ( isinstance(lk.dtype, CategoricalDtype) and isinstance(rk.dtype, CategoricalDtype) and lk.dtype == rk.dtype ): assert isinstance(lk, Categorical) assert isinstance(rk, Categorical) # Cast rk to encoding so we can compare codes with lk rk = lk._encode_with_my_categories(rk) lk = ensure_int64(lk.codes) rk = ensure_int64(rk.codes) elif isinstance(lk, ExtensionArray) and lk.dtype == rk.dtype: if (isinstance(lk.dtype, ArrowDtype) and is_string_dtype(lk.dtype)) or ( isinstance(lk.dtype, StringDtype) and lk.dtype.storage in ["pyarrow", "pyarrow_numpy"] ): import pyarrow as pa import pyarrow.compute as pc len_lk = len(lk) lk = lk._pa_array # type: ignore[attr-defined] rk = rk._pa_array # type: ignore[union-attr] dc = ( pa.chunked_array(lk.chunks + rk.chunks) # type: ignore[union-attr] .combine_chunks() .dictionary_encode() ) llab, rlab, count = ( pc.fill_null(dc.indices[slice(len_lk)], -1) .to_numpy() .astype(np.intp, copy=False), pc.fill_null(dc.indices[slice(len_lk, None)], -1) .to_numpy() .astype(np.intp, copy=False), len(dc.dictionary), ) if sort: uniques = dc.dictionary.to_numpy(zero_copy_only=False) llab, rlab = _sort_labels(uniques, llab, rlab) if dc.null_count > 0: lmask = llab == -1 lany = lmask.any() rmask = rlab == -1 rany = rmask.any() if lany: np.putmask(llab, lmask, count) if rany: > np.putmask(rlab, rmask, count) E ValueError: putmask: output array is read-only count = 5 dc = -- dictionary: [ "a", "b", "c", "d", "e" ] -- indices: [ 0, 1, 2, 3, null, null, null, null, null, 4 ] lany = False len_lk = 4 lk = [ [ "a", "b", "c", "d" ] ] llab = array([0, 1, 2, 3]) lmask = array([False, False, False, False]) pa = pc = rany = True rk = [ [ null, null, null, null, null, "e" ] ] rlab = array([-1, -1, -1, -1, -1, 4]) rmask = array([ True, True, True, True, True, False]) sort = False pandas/core/reshape/merge.py:2514: ValueError ________________________________________ TestMergeMulti.test_left_join_multi_index[False-True] ________________________________________ [gw3] linux -- Python 3.11.7 /var/tmp/portage/dev-python/pandas-2.2.0-r1/work/pandas-2.2.0-python3_11/install/usr/bin/python3.11 self = , sort = False, infer_string = True @pytest.mark.parametrize( "infer_string", [False, pytest.param(True, marks=td.skip_if_no("pyarrow"))] ) @pytest.mark.parametrize("sort", [True, False]) def test_left_join_multi_index(self, sort, infer_string): with option_context("future.infer_string", infer_string): icols = ["1st", "2nd", "3rd"] def bind_cols(df): iord = lambda a: 0 if a != a else ord(a) f = lambda ts: ts.map(iord) - ord("a") return f(df["1st"]) + f(df["3rd"]) * 1e2 + df["2nd"].fillna(0) * 10 def run_asserts(left, right, sort): res = left.join(right, on=icols, how="left", sort=sort) assert len(left) < len(res) + 1 assert not res["4th"].isna().any() assert not res["5th"].isna().any() tm.assert_series_equal(res["4th"], -res["5th"], check_names=False) result = bind_cols(res.iloc[:, :-2]) tm.assert_series_equal(res["4th"], result, check_names=False) assert result.name is None if sort: tm.assert_frame_equal(res, res.sort_values(icols, kind="mergesort")) out = merge(left, right.reset_index(), on=icols, sort=sort, how="left") res.index = RangeIndex(len(res)) tm.assert_frame_equal(out, res) lc = list(map(chr, np.arange(ord("a"), ord("z") + 1))) left = DataFrame( np.random.default_rng(2).choice(lc, (50, 2)), columns=["1st", "3rd"] ) # Explicit cast to float to avoid implicit cast when setting nan left.insert( 1, "2nd", np.random.default_rng(2).integers(0, 10, len(left)).astype("float"), ) i = np.random.default_rng(2).permutation(len(left)) right = left.iloc[i].copy() left["4th"] = bind_cols(left) right["5th"] = -bind_cols(right) right.set_index(icols, inplace=True) run_asserts(left, right, sort) # inject some nulls left.loc[1::4, "1st"] = np.nan left.loc[2::5, "2nd"] = np.nan left.loc[3::6, "3rd"] = np.nan left["4th"] = bind_cols(left) i = np.random.default_rng(2).permutation(len(left)) right = left.iloc[i, :-1] right["5th"] = -bind_cols(right) right.set_index(icols, inplace=True) > run_asserts(left, right, sort) bind_cols = .bind_cols at 0xc9928b18> i = array([ 6, 40, 33, 38, 7, 46, 28, 45, 5, 34, 12, 18, 27, 3, 9, 39, 42, 23, 0, 26, 4, 10, 14, 41, 16, 43, 15, 48, 13, 24, 20, 25, 22, 49, 2, 11, 32, 44, 47, 17, 19, 37, 21, 29, 31, 30, 35, 36, 8, 1]) icols = ['1st', '2nd', '3rd'] infer_string = True lc = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'] left = 1st 2nd 3rd 4th 0 v 8.0 g 701.0 1 NaN 2.0 h 623.0 2 k NaN v 2110.0 3 l 2.0 NaN -9669.0 4 i 4.0 p 1548.0 5 NaN 8.0 s 1783.0 6 z 4.0 e 465.0 7 w NaN b 122.0 8 o 3.0 h 744.0 9 NaN 6.0 NaN -9737.0 10 h 8.0 o 1487.0 11 g 7.0 d 376.0 12 t NaN l 1119.0 13 NaN 1.0 r 1613.0 14 y 8.0 k 1104.0 15 f 0.0 NaN -9695.0 16 y 5.0 z 2574.0 17 NaN NaN r 1603.0 18 j 2.0 k 1029.0 19 b 6.0 e 461.0 20 i 3.0 i 838.0 21 NaN 5.0 NaN -9747.0 22 s NaN x 2318.0 23 w 1.0 u 2032.0 24 z 7.0 i 895.0 25 NaN 4.0 y 2343.0 26 f 6.0 m 1265.0 27 o NaN NaN -9686.0 28 s 9.0 c 308.0 29 NaN 4.0 c 143.0 30 y 2.0 f 544.0 31 l 6.0 w 2271.0 32 n NaN r 1713.0 33 NaN 9.0 NaN -9707.0 34 p 8.0 q 1695.0 35 l 6.0 k 1071.0 36 p 3.0 n 1345.0 37 NaN NaN p 1403.0 38 m 0.0 w 2212.0 39 f 1.0 NaN -9685.0 40 m 3.0 x 2342.0 41 NaN 3.0 p 1433.0 42 b NaN v 2101.0 43 l 5.0 m 1261.0 44 c 6.0 s 1862.0 45 NaN 8.0 NaN -9717.0 46 t 8.0 n 1399.0 47 b NaN f 501.0 48 g 9.0 c 296.0 49 NaN 3.0 b 33.0 right = 5th 1st 2nd 3rd z 4.0 e -465.0 m 3.0 x -2342.0 nan 9.0 nan 9707.0 m 0.0 w -2212.0 w NaN b -122.0 t 8.0 n -1399.0 s 9.0 c -308.0 nan 8.0 nan 9717.0 s -1783.0 p 8.0 q -1695.0 t NaN l -1119.0 j 2.0 k -1029.0 o NaN nan 9686.0 l 2.0 nan 9669.0 nan 6.0 nan 9737.0 f 1.0 nan 9685.0 b NaN v -2101.0 w 1.0 u -2032.0 v 8.0 g -701.0 f 6.0 m -1265.0 i 4.0 p -1548.0 h 8.0 o -1487.0 y 8.0 k -1104.0 nan 3.0 p -1433.0 y 5.0 z -2574.0 l 5.0 m -1261.0 f 0.0 nan 9695.0 g 9.0 c -296.0 nan 1.0 r -1613.0 z 7.0 i -895.0 i 3.0 i -838.0 nan 4.0 y -2343.0 s NaN x -2318.0 nan 3.0 b -33.0 k NaN v -2110.0 g 7.0 d -376.0 n NaN r -1713.0 c 6.0 s -1862.0 b NaN f -501.0 nan NaN r -1603.0 b 6.0 e -461.0 nan NaN p -1403.0 5.0 nan 9747.0 4.0 c -143.0 l 6.0 w -2271.0 y 2.0 f -544.0 l 6.0 k -1071.0 p 3.0 n -1345.0 o 3.0 h -744.0 nan 2.0 h -623.0 run_asserts = .run_asserts at 0xc9928bb8> self = sort = False pandas/tests/reshape/merge/test_multi.py:158: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pandas/tests/reshape/merge/test_multi.py:108: in run_asserts res = left.join(right, on=icols, how="left", sort=sort) bind_cols = .bind_cols at 0xc9928b18> icols = ['1st', '2nd', '3rd'] left = 1st 2nd 3rd 4th 0 v 8.0 g 701.0 1 NaN 2.0 h 623.0 2 k NaN v 2110.0 3 l 2.0 NaN -9669.0 4 i 4.0 p 1548.0 5 NaN 8.0 s 1783.0 6 z 4.0 e 465.0 7 w NaN b 122.0 8 o 3.0 h 744.0 9 NaN 6.0 NaN -9737.0 10 h 8.0 o 1487.0 11 g 7.0 d 376.0 12 t NaN l 1119.0 13 NaN 1.0 r 1613.0 14 y 8.0 k 1104.0 15 f 0.0 NaN -9695.0 16 y 5.0 z 2574.0 17 NaN NaN r 1603.0 18 j 2.0 k 1029.0 19 b 6.0 e 461.0 20 i 3.0 i 838.0 21 NaN 5.0 NaN -9747.0 22 s NaN x 2318.0 23 w 1.0 u 2032.0 24 z 7.0 i 895.0 25 NaN 4.0 y 2343.0 26 f 6.0 m 1265.0 27 o NaN NaN -9686.0 28 s 9.0 c 308.0 29 NaN 4.0 c 143.0 30 y 2.0 f 544.0 31 l 6.0 w 2271.0 32 n NaN r 1713.0 33 NaN 9.0 NaN -9707.0 34 p 8.0 q 1695.0 35 l 6.0 k 1071.0 36 p 3.0 n 1345.0 37 NaN NaN p 1403.0 38 m 0.0 w 2212.0 39 f 1.0 NaN -9685.0 40 m 3.0 x 2342.0 41 NaN 3.0 p 1433.0 42 b NaN v 2101.0 43 l 5.0 m 1261.0 44 c 6.0 s 1862.0 45 NaN 8.0 NaN -9717.0 46 t 8.0 n 1399.0 47 b NaN f 501.0 48 g 9.0 c 296.0 49 NaN 3.0 b 33.0 right = 5th 1st 2nd 3rd z 4.0 e -465.0 m 3.0 x -2342.0 nan 9.0 nan 9707.0 m 0.0 w -2212.0 w NaN b -122.0 t 8.0 n -1399.0 s 9.0 c -308.0 nan 8.0 nan 9717.0 s -1783.0 p 8.0 q -1695.0 t NaN l -1119.0 j 2.0 k -1029.0 o NaN nan 9686.0 l 2.0 nan 9669.0 nan 6.0 nan 9737.0 f 1.0 nan 9685.0 b NaN v -2101.0 w 1.0 u -2032.0 v 8.0 g -701.0 f 6.0 m -1265.0 i 4.0 p -1548.0 h 8.0 o -1487.0 y 8.0 k -1104.0 nan 3.0 p -1433.0 y 5.0 z -2574.0 l 5.0 m -1261.0 f 0.0 nan 9695.0 g 9.0 c -296.0 nan 1.0 r -1613.0 z 7.0 i -895.0 i 3.0 i -838.0 nan 4.0 y -2343.0 s NaN x -2318.0 nan 3.0 b -33.0 k NaN v -2110.0 g 7.0 d -376.0 n NaN r -1713.0 c 6.0 s -1862.0 b NaN f -501.0 nan NaN r -1603.0 b 6.0 e -461.0 nan NaN p -1403.0 5.0 nan 9747.0 4.0 c -143.0 l 6.0 w -2271.0 y 2.0 f -544.0 l 6.0 k -1071.0 p 3.0 n -1345.0 o 3.0 h -744.0 nan 2.0 h -623.0 sort = False pandas/core/frame.py:10730: in join return merge( concat = how = 'left' lsuffix = '' merge = on = ['1st', '2nd', '3rd'] other = 5th 1st 2nd 3rd z 4.0 e -465.0 m 3.0 x -2342.0 nan 9.0 nan 9707.0 m 0.0 w -2212.0 w NaN b -122.0 t 8.0 n -1399.0 s 9.0 c -308.0 nan 8.0 nan 9717.0 s -1783.0 p 8.0 q -1695.0 t NaN l -1119.0 j 2.0 k -1029.0 o NaN nan 9686.0 l 2.0 nan 9669.0 nan 6.0 nan 9737.0 f 1.0 nan 9685.0 b NaN v -2101.0 w 1.0 u -2032.0 v 8.0 g -701.0 f 6.0 m -1265.0 i 4.0 p -1548.0 h 8.0 o -1487.0 y 8.0 k -1104.0 nan 3.0 p -1433.0 y 5.0 z -2574.0 l 5.0 m -1261.0 f 0.0 nan 9695.0 g 9.0 c -296.0 nan 1.0 r -1613.0 z 7.0 i -895.0 i 3.0 i -838.0 nan 4.0 y -2343.0 s NaN x -2318.0 nan 3.0 b -33.0 k NaN v -2110.0 g 7.0 d -376.0 n NaN r -1713.0 c 6.0 s -1862.0 b NaN f -501.0 nan NaN r -1603.0 b 6.0 e -461.0 nan NaN p -1403.0 5.0 nan 9747.0 4.0 c -143.0 l 6.0 w -2271.0 y 2.0 f -544.0 l 6.0 k -1071.0 p 3.0 n -1345.0 o 3.0 h -744.0 nan 2.0 h -623.0 rsuffix = '' self = 1st 2nd 3rd 4th 0 v 8.0 g 701.0 1 NaN 2.0 h 623.0 2 k NaN v 2110.0 3 l 2.0 NaN -9669.0 4 i 4.0 p 1548.0 5 NaN 8.0 s 1783.0 6 z 4.0 e 465.0 7 w NaN b 122.0 8 o 3.0 h 744.0 9 NaN 6.0 NaN -9737.0 10 h 8.0 o 1487.0 11 g 7.0 d 376.0 12 t NaN l 1119.0 13 NaN 1.0 r 1613.0 14 y 8.0 k 1104.0 15 f 0.0 NaN -9695.0 16 y 5.0 z 2574.0 17 NaN NaN r 1603.0 18 j 2.0 k 1029.0 19 b 6.0 e 461.0 20 i 3.0 i 838.0 21 NaN 5.0 NaN -9747.0 22 s NaN x 2318.0 23 w 1.0 u 2032.0 24 z 7.0 i 895.0 25 NaN 4.0 y 2343.0 26 f 6.0 m 1265.0 27 o NaN NaN -9686.0 28 s 9.0 c 308.0 29 NaN 4.0 c 143.0 30 y 2.0 f 544.0 31 l 6.0 w 2271.0 32 n NaN r 1713.0 33 NaN 9.0 NaN -9707.0 34 p 8.0 q 1695.0 35 l 6.0 k 1071.0 36 p 3.0 n 1345.0 37 NaN NaN p 1403.0 38 m 0.0 w 2212.0 39 f 1.0 NaN -9685.0 40 m 3.0 x 2342.0 41 NaN 3.0 p 1433.0 42 b NaN v 2101.0 43 l 5.0 m 1261.0 44 c 6.0 s 1862.0 45 NaN 8.0 NaN -9717.0 46 t 8.0 n 1399.0 47 b NaN f 501.0 48 g 9.0 c 296.0 49 NaN 3.0 b 33.0 sort = False validate = None pandas/core/reshape/merge.py:184: in merge return op.get_result(copy=copy) copy = None how = 'left' indicator = False left = 1st 2nd 3rd 4th 0 v 8.0 g 701.0 1 NaN 2.0 h 623.0 2 k NaN v 2110.0 3 l 2.0 NaN -9669.0 4 i 4.0 p 1548.0 5 NaN 8.0 s 1783.0 6 z 4.0 e 465.0 7 w NaN b 122.0 8 o 3.0 h 744.0 9 NaN 6.0 NaN -9737.0 10 h 8.0 o 1487.0 11 g 7.0 d 376.0 12 t NaN l 1119.0 13 NaN 1.0 r 1613.0 14 y 8.0 k 1104.0 15 f 0.0 NaN -9695.0 16 y 5.0 z 2574.0 17 NaN NaN r 1603.0 18 j 2.0 k 1029.0 19 b 6.0 e 461.0 20 i 3.0 i 838.0 21 NaN 5.0 NaN -9747.0 22 s NaN x 2318.0 23 w 1.0 u 2032.0 24 z 7.0 i 895.0 25 NaN 4.0 y 2343.0 26 f 6.0 m 1265.0 27 o NaN NaN -9686.0 28 s 9.0 c 308.0 29 NaN 4.0 c 143.0 30 y 2.0 f 544.0 31 l 6.0 w 2271.0 32 n NaN r 1713.0 33 NaN 9.0 NaN -9707.0 34 p 8.0 q 1695.0 35 l 6.0 k 1071.0 36 p 3.0 n 1345.0 37 NaN NaN p 1403.0 38 m 0.0 w 2212.0 39 f 1.0 NaN -9685.0 40 m 3.0 x 2342.0 41 NaN 3.0 p 1433.0 42 b NaN v 2101.0 43 l 5.0 m 1261.0 44 c 6.0 s 1862.0 45 NaN 8.0 NaN -9717.0 46 t 8.0 n 1399.0 47 b NaN f 501.0 48 g 9.0 c 296.0 49 NaN 3.0 b 33.0 left_df = 1st 2nd 3rd 4th 0 v 8.0 g 701.0 1 NaN 2.0 h 623.0 2 k NaN v 2110.0 3 l 2.0 NaN -9669.0 4 i 4.0 p 1548.0 5 NaN 8.0 s 1783.0 6 z 4.0 e 465.0 7 w NaN b 122.0 8 o 3.0 h 744.0 9 NaN 6.0 NaN -9737.0 10 h 8.0 o 1487.0 11 g 7.0 d 376.0 12 t NaN l 1119.0 13 NaN 1.0 r 1613.0 14 y 8.0 k 1104.0 15 f 0.0 NaN -9695.0 16 y 5.0 z 2574.0 17 NaN NaN r 1603.0 18 j 2.0 k 1029.0 19 b 6.0 e 461.0 20 i 3.0 i 838.0 21 NaN 5.0 NaN -9747.0 22 s NaN x 2318.0 23 w 1.0 u 2032.0 24 z 7.0 i 895.0 25 NaN 4.0 y 2343.0 26 f 6.0 m 1265.0 27 o NaN NaN -9686.0 28 s 9.0 c 308.0 29 NaN 4.0 c 143.0 30 y 2.0 f 544.0 31 l 6.0 w 2271.0 32 n NaN r 1713.0 33 NaN 9.0 NaN -9707.0 34 p 8.0 q 1695.0 35 l 6.0 k 1071.0 36 p 3.0 n 1345.0 37 NaN NaN p 1403.0 38 m 0.0 w 2212.0 39 f 1.0 NaN -9685.0 40 m 3.0 x 2342.0 41 NaN 3.0 p 1433.0 42 b NaN v 2101.0 43 l 5.0 m 1261.0 44 c 6.0 s 1862.0 45 NaN 8.0 NaN -9717.0 46 t 8.0 n 1399.0 47 b NaN f 501.0 48 g 9.0 c 296.0 49 NaN 3.0 b 33.0 left_index = False left_on = ['1st', '2nd', '3rd'] on = None op = right = 5th 1st 2nd 3rd z 4.0 e -465.0 m 3.0 x -2342.0 nan 9.0 nan 9707.0 m 0.0 w -2212.0 w NaN b -122.0 t 8.0 n -1399.0 s 9.0 c -308.0 nan 8.0 nan 9717.0 s -1783.0 p 8.0 q -1695.0 t NaN l -1119.0 j 2.0 k -1029.0 o NaN nan 9686.0 l 2.0 nan 9669.0 nan 6.0 nan 9737.0 f 1.0 nan 9685.0 b NaN v -2101.0 w 1.0 u -2032.0 v 8.0 g -701.0 f 6.0 m -1265.0 i 4.0 p -1548.0 h 8.0 o -1487.0 y 8.0 k -1104.0 nan 3.0 p -1433.0 y 5.0 z -2574.0 l 5.0 m -1261.0 f 0.0 nan 9695.0 g 9.0 c -296.0 nan 1.0 r -1613.0 z 7.0 i -895.0 i 3.0 i -838.0 nan 4.0 y -2343.0 s NaN x -2318.0 nan 3.0 b -33.0 k NaN v -2110.0 g 7.0 d -376.0 n NaN r -1713.0 c 6.0 s -1862.0 b NaN f -501.0 nan NaN r -1603.0 b 6.0 e -461.0 nan NaN p -1403.0 5.0 nan 9747.0 4.0 c -143.0 l 6.0 w -2271.0 y 2.0 f -544.0 l 6.0 k -1071.0 p 3.0 n -1345.0 o 3.0 h -744.0 nan 2.0 h -623.0 right_df = 5th 1st 2nd 3rd z 4.0 e -465.0 m 3.0 x -2342.0 nan 9.0 nan 9707.0 m 0.0 w -2212.0 w NaN b -122.0 t 8.0 n -1399.0 s 9.0 c -308.0 nan 8.0 nan 9717.0 s -1783.0 p 8.0 q -1695.0 t NaN l -1119.0 j 2.0 k -1029.0 o NaN nan 9686.0 l 2.0 nan 9669.0 nan 6.0 nan 9737.0 f 1.0 nan 9685.0 b NaN v -2101.0 w 1.0 u -2032.0 v 8.0 g -701.0 f 6.0 m -1265.0 i 4.0 p -1548.0 h 8.0 o -1487.0 y 8.0 k -1104.0 nan 3.0 p -1433.0 y 5.0 z -2574.0 l 5.0 m -1261.0 f 0.0 nan 9695.0 g 9.0 c -296.0 nan 1.0 r -1613.0 z 7.0 i -895.0 i 3.0 i -838.0 nan 4.0 y -2343.0 s NaN x -2318.0 nan 3.0 b -33.0 k NaN v -2110.0 g 7.0 d -376.0 n NaN r -1713.0 c 6.0 s -1862.0 b NaN f -501.0 nan NaN r -1603.0 b 6.0 e -461.0 nan NaN p -1403.0 5.0 nan 9747.0 4.0 c -143.0 l 6.0 w -2271.0 y 2.0 f -544.0 l 6.0 k -1071.0 p 3.0 n -1345.0 o 3.0 h -744.0 nan 2.0 h -623.0 right_index = True right_on = None sort = False suffixes = ('', '') validate = None pandas/core/reshape/merge.py:886: in get_result join_index, left_indexer, right_indexer = self._get_join_info() copy = None self = pandas/core/reshape/merge.py:1142: in _get_join_info join_index, left_indexer, right_indexer = _left_join_on_index( left_ax = RangeIndex(start=0, stop=50, step=1) right_ax = MultiIndex([('z', 4.0, 'e'), ('m', 3.0, 'x'), (nan, 9.0, nan), ('m', 0.0, 'w'), ('w', nan, 'b'), ('t', 8.0, 'n'), ('s', 9.0, 'c'), (nan, 8.0, nan), (nan, 8.0, 's'), ('p', 8.0, 'q'), ('t', nan, 'l'), ('j', 2.0, 'k'), ('o', nan, nan), ('l', 2.0, nan), (nan, 6.0, nan), ('f', 1.0, nan), ('b', nan, 'v'), ('w', 1.0, 'u'), ('v', 8.0, 'g'), ('f', 6.0, 'm'), ('i', 4.0, 'p'), ('h', 8.0, 'o'), ('y', 8.0, 'k'), (nan, 3.0, 'p'), ('y', 5.0, 'z'), ('l', 5.0, 'm'), ('f', 0.0, nan), ('g', 9.0, 'c'), (nan, 1.0, 'r'), ('z', 7.0, 'i'), ('i', 3.0, 'i'), (nan, 4.0, 'y'), ('s', nan, 'x'), (nan, 3.0, 'b'), ('k', nan, 'v'), ('g', 7.0, 'd'), ('n', nan, 'r'), ('c', 6.0, 's'), ('b', nan, 'f'), (nan, nan, 'r'), ('b', 6.0, 'e'), (nan, nan, 'p'), (nan, 5.0, nan), (nan, 4.0, 'c'), ('l', 6.0, 'w'), ('y', 2.0, 'f'), ('l', 6.0, 'k'), ('p', 3.0, 'n'), ('o', 3.0, 'h'), (nan, 2.0, 'h')], names=['1st', '2nd', '3rd']) self = pandas/core/reshape/merge.py:2375: in _left_join_on_index lkey, rkey = _get_multiindex_indexer(join_keys, right_ax, sort=sort) join_keys = [ ['v', nan, 'k', 'l', 'i', nan, 'z', 'w', 'o', nan, 'h', 'g', 't', nan, 'y', 'f', 'y', nan, 'j', 'b', 'i', nan, 's', 'w', 'z', nan, 'f', 'o', 's', nan, 'y', 'l', 'n', nan, 'p', 'l', 'p', nan, 'm', 'f', 'm', nan, 'b', 'l', 'c', nan, 't', 'b', 'g', nan] Length: 50, dtype: string, array([ 8., 2., nan, 2., 4., 8., 4., nan, 3., 6., 8., 7., nan, 1., 8., 0., 5., nan, 2., 6., 3., 5., nan, 1., 7., 4., 6., nan, 9., 4., 2., 6., nan, 9., 8., 6., 3., nan, 0., 1., 3., 3., nan, 5., 6., 8., 8., nan, 9., 3.]), ['g', 'h', 'v', nan, 'p', 's', 'e', 'b', 'h', nan, 'o', 'd', 'l', 'r', 'k', nan, 'z', 'r', 'k', 'e', 'i', nan, 'x', 'u', 'i', 'y', 'm', nan, 'c', 'c', 'f', 'w', 'r', nan, 'q', 'k', 'n', 'p', 'w', nan, 'x', 'p', 'v', 'm', 's', nan, 'n', 'f', 'c', 'b'] Length: 50, dtype: string] left_ax = RangeIndex(start=0, stop=50, step=1) right_ax = MultiIndex([('z', 4.0, 'e'), ('m', 3.0, 'x'), (nan, 9.0, nan), ('m', 0.0, 'w'), ('w', nan, 'b'), ('t', 8.0, 'n'), ('s', 9.0, 'c'), (nan, 8.0, nan), (nan, 8.0, 's'), ('p', 8.0, 'q'), ('t', nan, 'l'), ('j', 2.0, 'k'), ('o', nan, nan), ('l', 2.0, nan), (nan, 6.0, nan), ('f', 1.0, nan), ('b', nan, 'v'), ('w', 1.0, 'u'), ('v', 8.0, 'g'), ('f', 6.0, 'm'), ('i', 4.0, 'p'), ('h', 8.0, 'o'), ('y', 8.0, 'k'), (nan, 3.0, 'p'), ('y', 5.0, 'z'), ('l', 5.0, 'm'), ('f', 0.0, nan), ('g', 9.0, 'c'), (nan, 1.0, 'r'), ('z', 7.0, 'i'), ('i', 3.0, 'i'), (nan, 4.0, 'y'), ('s', nan, 'x'), (nan, 3.0, 'b'), ('k', nan, 'v'), ('g', 7.0, 'd'), ('n', nan, 'r'), ('c', 6.0, 's'), ('b', nan, 'f'), (nan, nan, 'r'), ('b', 6.0, 'e'), (nan, nan, 'p'), (nan, 5.0, nan), (nan, 4.0, 'c'), ('l', 6.0, 'w'), ('y', 2.0, 'f'), ('l', 6.0, 'k'), ('p', 3.0, 'n'), ('o', 3.0, 'h'), (nan, 2.0, 'h')], names=['1st', '2nd', '3rd']) sort = False pandas/core/reshape/merge.py:2309: in _get_multiindex_indexer zipped = zip(*mapped) index = MultiIndex([('z', 4.0, 'e'), ('m', 3.0, 'x'), (nan, 9.0, nan), ('m', 0.0, 'w'), ('w', nan, 'b'), ('t', 8.0, 'n'), ('s', 9.0, 'c'), (nan, 8.0, nan), (nan, 8.0, 's'), ('p', 8.0, 'q'), ('t', nan, 'l'), ('j', 2.0, 'k'), ('o', nan, nan), ('l', 2.0, nan), (nan, 6.0, nan), ('f', 1.0, nan), ('b', nan, 'v'), ('w', 1.0, 'u'), ('v', 8.0, 'g'), ('f', 6.0, 'm'), ('i', 4.0, 'p'), ('h', 8.0, 'o'), ('y', 8.0, 'k'), (nan, 3.0, 'p'), ('y', 5.0, 'z'), ('l', 5.0, 'm'), ('f', 0.0, nan), ('g', 9.0, 'c'), (nan, 1.0, 'r'), ('z', 7.0, 'i'), ('i', 3.0, 'i'), (nan, 4.0, 'y'), ('s', nan, 'x'), (nan, 3.0, 'b'), ('k', nan, 'v'), ('g', 7.0, 'd'), ('n', nan, 'r'), ('c', 6.0, 's'), ('b', nan, 'f'), (nan, nan, 'r'), ('b', 6.0, 'e'), (nan, nan, 'p'), (nan, 5.0, nan), (nan, 4.0, 'c'), ('l', 6.0, 'w'), ('y', 2.0, 'f'), ('l', 6.0, 'k'), ('p', 3.0, 'n'), ('o', 3.0, 'h'), (nan, 2.0, 'h')], names=['1st', '2nd', '3rd']) join_keys = [ ['v', nan, 'k', 'l', 'i', nan, 'z', 'w', 'o', nan, 'h', 'g', 't', nan, 'y', 'f', 'y', nan, 'j', 'b', 'i', nan, 's', 'w', 'z', nan, 'f', 'o', 's', nan, 'y', 'l', 'n', nan, 'p', 'l', 'p', nan, 'm', 'f', 'm', nan, 'b', 'l', 'c', nan, 't', 'b', 'g', nan] Length: 50, dtype: string, array([ 8., 2., nan, 2., 4., 8., 4., nan, 3., 6., 8., 7., nan, 1., 8., 0., 5., nan, 2., 6., 3., 5., nan, 1., 7., 4., 6., nan, 9., 4., 2., 6., nan, 9., 8., 6., 3., nan, 0., 1., 3., 3., nan, 5., 6., 8., 8., nan, 9., 3.]), ['g', 'h', 'v', nan, 'p', 's', 'e', 'b', 'h', nan, 'o', 'd', 'l', 'r', 'k', nan, 'z', 'r', 'k', 'e', 'i', nan, 'x', 'u', 'i', 'y', 'm', nan, 'c', 'c', 'f', 'w', 'r', nan, 'q', 'k', 'n', 'p', 'w', nan, 'x', 'p', 'v', 'm', 's', nan, 'n', 'f', 'c', 'b'] Length: 50, dtype: string] mapped = . at 0xc9b7a710> sort = False pandas/core/reshape/merge.py:2306: in _factorize_keys(index.levels[n]._values, join_keys[n], sort=sort) .0 = index = MultiIndex([('z', 4.0, 'e'), ('m', 3.0, 'x'), (nan, 9.0, nan), ('m', 0.0, 'w'), ('w', nan, 'b'), ('t', 8.0, 'n'), ('s', 9.0, 'c'), (nan, 8.0, nan), (nan, 8.0, 's'), ('p', 8.0, 'q'), ('t', nan, 'l'), ('j', 2.0, 'k'), ('o', nan, nan), ('l', 2.0, nan), (nan, 6.0, nan), ('f', 1.0, nan), ('b', nan, 'v'), ('w', 1.0, 'u'), ('v', 8.0, 'g'), ('f', 6.0, 'm'), ('i', 4.0, 'p'), ('h', 8.0, 'o'), ('y', 8.0, 'k'), (nan, 3.0, 'p'), ('y', 5.0, 'z'), ('l', 5.0, 'm'), ('f', 0.0, nan), ('g', 9.0, 'c'), (nan, 1.0, 'r'), ('z', 7.0, 'i'), ('i', 3.0, 'i'), (nan, 4.0, 'y'), ('s', nan, 'x'), (nan, 3.0, 'b'), ('k', nan, 'v'), ('g', 7.0, 'd'), ('n', nan, 'r'), ('c', 6.0, 's'), ('b', nan, 'f'), (nan, nan, 'r'), ('b', 6.0, 'e'), (nan, nan, 'p'), (nan, 5.0, nan), (nan, 4.0, 'c'), ('l', 6.0, 'w'), ('y', 2.0, 'f'), ('l', 6.0, 'k'), ('p', 3.0, 'n'), ('o', 3.0, 'h'), (nan, 2.0, 'h')], names=['1st', '2nd', '3rd']) join_keys = [ ['v', nan, 'k', 'l', 'i', nan, 'z', 'w', 'o', nan, 'h', 'g', 't', nan, 'y', 'f', 'y', nan, 'j', 'b', 'i', nan, 's', 'w', 'z', nan, 'f', 'o', 's', nan, 'y', 'l', 'n', nan, 'p', 'l', 'p', nan, 'm', 'f', 'm', nan, 'b', 'l', 'c', nan, 't', 'b', 'g', nan] Length: 50, dtype: string, array([ 8., 2., nan, 2., 4., 8., 4., nan, 3., 6., 8., 7., nan, 1., 8., 0., 5., nan, 2., 6., 3., 5., nan, 1., 7., 4., 6., nan, 9., 4., 2., 6., nan, 9., 8., 6., 3., nan, 0., 1., 3., 3., nan, 5., 6., 8., 8., nan, 9., 3.]), ['g', 'h', 'v', nan, 'p', 's', 'e', 'b', 'h', nan, 'o', 'd', 'l', 'r', 'k', nan, 'z', 'r', 'k', 'e', 'i', nan, 'x', 'u', 'i', 'y', 'm', nan, 'c', 'c', 'f', 'w', 'r', nan, 'q', 'k', 'n', 'p', 'w', nan, 'x', 'p', 'v', 'm', 's', nan, 'n', 'f', 'c', 'b'] Length: 50, dtype: string] n = 0 sort = False _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ lk = [ [ "b", "c", "f", "g", "h", ... "t", "v", "w", "y", "z" ] ] rk = [ [ "v", null, "k", "l", "i", ... null, "t", "b", "g", null ] ] sort = False def _factorize_keys( lk: ArrayLike, rk: ArrayLike, sort: bool = True ) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp], int]: """ Encode left and right keys as enumerated types. This is used to get the join indexers to be used when merging DataFrames. Parameters ---------- lk : ndarray, ExtensionArray Left key. rk : ndarray, ExtensionArray Right key. sort : bool, defaults to True If True, the encoding is done such that the unique elements in the keys are sorted. Returns ------- np.ndarray[np.intp] Left (resp. right if called with `key='right'`) labels, as enumerated type. np.ndarray[np.intp] Right (resp. left if called with `key='right'`) labels, as enumerated type. int Number of unique elements in union of left and right labels. See Also -------- merge : Merge DataFrame or named Series objects with a database-style join. algorithms.factorize : Encode the object as an enumerated type or categorical variable. Examples -------- >>> lk = np.array(["a", "c", "b"]) >>> rk = np.array(["a", "c"]) Here, the unique values are `'a', 'b', 'c'`. With the default `sort=True`, the encoding will be `{0: 'a', 1: 'b', 2: 'c'}`: >>> pd.core.reshape.merge._factorize_keys(lk, rk) (array([0, 2, 1]), array([0, 2]), 3) With the `sort=False`, the encoding will correspond to the order in which the unique elements first appear: `{0: 'a', 1: 'c', 2: 'b'}`: >>> pd.core.reshape.merge._factorize_keys(lk, rk, sort=False) (array([0, 1, 2]), array([0, 1]), 3) """ # TODO: if either is a RangeIndex, we can likely factorize more efficiently? if ( isinstance(lk.dtype, DatetimeTZDtype) and isinstance(rk.dtype, DatetimeTZDtype) ) or (lib.is_np_dtype(lk.dtype, "M") and lib.is_np_dtype(rk.dtype, "M")): # Extract the ndarray (UTC-localized) values # Note: we dont need the dtypes to match, as these can still be compared lk, rk = cast("DatetimeArray", lk)._ensure_matching_resos(rk) lk = cast("DatetimeArray", lk)._ndarray rk = cast("DatetimeArray", rk)._ndarray elif ( isinstance(lk.dtype, CategoricalDtype) and isinstance(rk.dtype, CategoricalDtype) and lk.dtype == rk.dtype ): assert isinstance(lk, Categorical) assert isinstance(rk, Categorical) # Cast rk to encoding so we can compare codes with lk rk = lk._encode_with_my_categories(rk) lk = ensure_int64(lk.codes) rk = ensure_int64(rk.codes) elif isinstance(lk, ExtensionArray) and lk.dtype == rk.dtype: if (isinstance(lk.dtype, ArrowDtype) and is_string_dtype(lk.dtype)) or ( isinstance(lk.dtype, StringDtype) and lk.dtype.storage in ["pyarrow", "pyarrow_numpy"] ): import pyarrow as pa import pyarrow.compute as pc len_lk = len(lk) lk = lk._pa_array # type: ignore[attr-defined] rk = rk._pa_array # type: ignore[union-attr] dc = ( pa.chunked_array(lk.chunks + rk.chunks) # type: ignore[union-attr] .combine_chunks() .dictionary_encode() ) llab, rlab, count = ( pc.fill_null(dc.indices[slice(len_lk)], -1) .to_numpy() .astype(np.intp, copy=False), pc.fill_null(dc.indices[slice(len_lk, None)], -1) .to_numpy() .astype(np.intp, copy=False), len(dc.dictionary), ) if sort: uniques = dc.dictionary.to_numpy(zero_copy_only=False) llab, rlab = _sort_labels(uniques, llab, rlab) if dc.null_count > 0: lmask = llab == -1 lany = lmask.any() rmask = rlab == -1 rany = rmask.any() if lany: np.putmask(llab, lmask, count) if rany: > np.putmask(rlab, rmask, count) E ValueError: putmask: output array is read-only count = 19 dc = -- dictionary: [ "b", "c", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "s", "t", "v", "w", "y", "z" ] -- indices: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 9, null, 0, 8, 1, null, 14, 0, 3, null ] lany = False len_lk = 19 lk = [ [ "b", "c", "f", "g", "h", ... "t", "v", "w", "y", "z" ] ] llab = array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]) lmask = array([False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]) pa = pc = rany = True rk = [ [ "v", null, "k", "l", "i", ... null, "t", "b", "g", null ] ] rlab = array([15, -1, 7, 8, 5, -1, 18, 16, 11, -1, 4, 3, 14, -1, 17, 2, 17, -1, 6, 0, 5, -1, 13, 16, 18, -1, 2, 11, 13, -1, 17, 8, 10, -1, 12, 8, 12, -1, 9, 2, 9, -1, 0, 8, 1, -1, 14, 0, 3, -1]) rmask = array([False, True, False, False, False, True, False, False, False, True, False, False, False, True, False, False, False, True, False, False, False, True, False, False, False, True, False, False, False, True, False, False, False, True, False, False, False, True, False, False, False, True, False, False, False, True, False, False, False, True]) sort = False pandas/core/reshape/merge.py:2514: ValueError ```

Full build & test log (2.5M .gz, 52M uncompressed): pandas.txt.gz

This is on Gentoo/x86 systemd-nspawn container. I'm using -O2 -march=pentium-m -mfpmath=sse -pipe flags to rule out i387-specific precision issues.

I've also filed https://github.com/apache/arrow/issues/40153 for test failures in pyarrow itself. Some of them could be possibly be bugs in pandas instead.

Expected Behavior

Tests passing ;-).

Installed Versions

INSTALLED VERSIONS ------------------ commit : fd3f57170aa1af588ba877e8e28c158a20a4886d python : 3.11.7.final.0 python-bits : 32 OS : Linux OS-release : 6.7.5-gentoo-dist Version : #1 SMP PREEMPT_DYNAMIC Sat Feb 17 07:30:27 -00 2024 machine : x86_64 processor : AMD Ryzen 5 3600 6-Core Processor byteorder : little LC_ALL : None LANG : C.UTF8 LOCALE : en_US.UTF-8 pandas : 2.2.0 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 69.0.3 pip : None Cython : 3.0.5 pytest : 7.4.4 hypothesis : 6.98.3 sphinx : None blosc : None feather : None xlsxwriter : 3.2.0 lxml.etree : 4.9.4 html5lib : 1.1 pymysql : 1.4.6 psycopg2 : None jinja2 : 3.1.3 IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : 1.3.7 dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.3 numba : None numexpr : 2.9.0 odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 2.0.27 tables : 3.9.2 tabulate : 0.9.0 xarray : 2024.2.0 xlrd : 2.0.1 zstandard : None tzdata : None qtpy : None pyqt5 : None
lithomas1 commented 8 months ago

The first error looks like we need to use the platform specific int type instead of int64 (haven't looked too closely, though).

The rest seem like they could potentially be related to bugs in Arrow.

mgorny commented 8 months ago

I already have a fix for one bug in Arrow, so I'm going to retry with it applied, later today.

mgorny commented 8 months ago

Yeah, my patch fixed pandas/tests/interchange/test_impl.py::test_large_string_pyarrow. I'll check if I can figure out a fix for the other three when I find some time.