rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.42k stars 902 forks source link

[BUG] Merge fails on 24.10 nightly with Failed to generate metadata for RenameAxis(frame=Merge(...), index=None). #16892

Closed praateekmahajan closed 1 month ago

praateekmahajan commented 1 month ago

Describe the bug

When performing a merge and left._meta.index_name != right._meta.index_name the behavior in dask-expr has changed https://github.com/dask/dask-expr/pull/1121/files

This raises a RuntimeError: Failed to generate metadata for RenameAxis(frame=Merge(75f6fd3), index=None). This operation may not be supported by the current backend. (full stacktrace + debug checkpoint screenshot at dask-expr/_collections.py below)

Steps/Code to reproduce bug We ran into this in crossfit when running our pytests. Here is a repro that would two methods from crossfit library (namely sample_raw and reset_global_index). I imagine a simpler reproduce is possible, but in a timeboxed manner this is what I was able to get

import dask_cudf
from crossfit.dataset.beir.raw import sample_raw
from crossfit.dataset.beir.load import reset_global_index
import os

dataset_name = "nq"
out_dir = None
blocksize = 2**30
raw_path = sample_raw(dataset_name, out_dir=out_dir, overwrite=False)

qrels_files = [
    f for f in os.listdir(os.path.join(raw_path, "qrels")) if f.endswith(".tsv")
]
qrels_file = qrels_files[0]

qrels_dtypes = {"query-id": "str", "corpus-id": "str", "score": "int32"}

queries_ddf = dask_cudf.read_json(
    os.path.join(raw_path, "queries.jsonl"),
    lines=True,
    blocksize=blocksize,
    dtype={"_id": "string", "text": "string"},
)[["_id", "text"]]
# if we don't call reset_global_index code works fine
queries_ddf = reset_global_index(queries_ddf)

qrels_ddf = dask_cudf.read_csv(
    os.path.join(raw_path, "qrels", qrels_file),
    sep="\t",
    dtype=qrels_dtypes,
)[["query-id", "corpus-id", "score"]]

qrels_ddf.merge(
    queries_ddf,
    left_on="query-id",
    right_on="_id",
    how="left",
)

print("Success")

Expected behavior Before 24.10 nightly the merge worked as expected

Installed crossfit using pip (i.e pip installed cudf etc)

cudf-cu12==24.10.0a373
dask==2024.9.0
dask-cuda==24.10.0a22
dask-cudf-cu12==24.10.0a373
dask-expr==1.1.14
libcudf-cu12==24.10.0a373
pylibcudf-cu12==24.10.0a373
raft-dask-cu12==24.10.0a38
rapids-dask-dependency==24.10.0a8

Environment details Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context

Traceback (most recent call last):
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/utils/utils.py", line 228, in __getattr__
    return self[key]
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/core/dataframe.py", line 1347, in __getitem__
    out = self._get_columns_by_label(arg)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/core/frame.py", line 358, in _get_columns_by_label
    return self._from_data_like_self(self._data.select_by_label(labels))
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/core/column_accessor.py", line 401, in select_by_label
    return self._select_by_label_grouped(key)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/core/column_accessor.py", line 563, in _select_by_label_grouped
    result = self._grouped_data[key]
KeyError: 'rename_axis'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/dask_expr/_core.py", line 470, in __getattr__
    return object.__getattribute__(self, key)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/dask_expr/_expr.py", line 496, in _meta
    return self.operation(*args, **self._kwargs)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/dask/utils.py", line 1241, in __call__
    return getattr(__obj, self.method)(*args, **kwargs)
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/cudf/utils/utils.py", line 230, in __getattr__
    raise AttributeError(
AttributeError: DataFrame object has no attribute rename_axis

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/dask_expr/_collection.py", line 4799, in new_collection
    meta = expr._meta
  File "/datasets/praateekm/env_setup/micromamba/envs/crossfit_2410/lib/python3.10/site-packages/dask_expr/_core.py", line 475, in __getattr__
    raise RuntimeError(
RuntimeError: Failed to generate metadata for RenameAxis(frame=Merge(75f6fd3), index=None). This operation may not be supported by the current backend.

Image

rjzamora commented 1 month ago

Thanks for raising this issue @praateekmahajan - Hopefully this will be resolved by https://github.com/rapidsai/cudf/pull/16899