[BUG] dask_cudf pivot_table function is broken: TypeError: StringIndex object is not iterable.

pvnick commented 8 months ago

Describe the bug Pivot_table fails on a dask_cudf dataframe due to an unimplemented Index iteration function:

Steps/Code to reproduce bug

ddf = dask_cudf.from_cudf(cudf.DataFrame(
    data={
        "A": ["foo", "bar", "bar"],
        "B": ["one", "two", "one"],
        "C": [1, 2, 3]
    }
), npartitions=1)
ddf = ddf.categorize("B")
ddf.pivot_table(index="A", columns="B", values="C")

Error:

TypeError                                 Traceback (most recent call last)
Cell In[3], line 9
      1 ddf = dask_cudf.from_cudf(cudf.DataFrame(
      2     data={
      3         "A": ["foo", "bar", "bar"],
   (...)
      6     }
      7 ), npartitions=1)
      8 ddf = ddf.categorize("B")
----> 9 ddf.pivot_table(index="A", columns="B", values="C")

File lib/python3.10/site-packages/dask/dataframe/core.py:6373, in DataFrame.pivot_table(self, index, columns, values, aggfunc)
   6352 """
   6353 Create a spreadsheet-style pivot table as a DataFrame. Target ``columns``
   6354 must have category dtype to infer result's ``columns``.
   (...)
   6369 table : DataFrame
   6370 """
   6371 from dask.dataframe.reshape import pivot_table
-> 6373 return pivot_table(
   6374     self, index=index, columns=columns, values=values, aggfunc=aggfunc
   6375 )

File lib/python3.10/site-packages/dask/dataframe/reshape.py:233, in pivot_table(df, index, columns, values, aggfunc)
    226     raise ValueError(
    227         "aggfunc must be either " + ", ".join(f"'{x}'" for x in available_aggfuncs)
    228     )
    230 # _emulate can't work for empty data
    231 # the result must have CategoricalIndex columns
--> 233 columns_contents = pd.CategoricalIndex(df[columns].cat.categories, name=columns)
    234 if is_scalar(values):
    235     new_columns = columns_contents

File lib/python3.10/site-packages/pandas/core/indexes/category.py:234, in CategoricalIndex.__new__(cls, data, categories, ordered, dtype, copy, name)
    231 if is_scalar(data):
    232     raise cls._scalar_data_error(data)
--> 234 data = Categorical(
    235     data, categories=categories, ordered=ordered, dtype=dtype, copy=copy
    236 )
    238 return cls._simple_new(data, name=name)

File lib/python3.10/site-packages/pandas/core/arrays/categorical.py:410, in Categorical.__init__(self, values, categories, ordered, dtype, fastpath, copy)
    408         dtype = CategoricalDtype(values.categories, dtype.ordered)
    409 elif not isinstance(values, (ABCIndex, ABCSeries, ExtensionArray)):
--> 410     values = com.convert_to_list_like(values)
    411     if isinstance(values, list) and len(values) == 0:
    412         # By convention, empty lists result in object dtype:
    413         values = np.array([], dtype=object)

File lib/python3.10/site-packages/pandas/core/common.py:541, in convert_to_list_like(values)
    539     return values
    540 elif isinstance(values, abc.Iterable) and not isinstance(values, str):
--> 541     return list(values)
    543 return [values]

File lib/python3.10/site-packages/cudf/utils/utils.py:242, in NotIterable.__iter__(self)
    235 def __iter__(self):
    236     """
    237     Iteration is unsupported.
    238 
    239     See :ref:`iteration <pandas-comparison/iteration>` for more
    240     information.
    241     """
--> 242     raise TypeError(
    243         f"{self.__class__.__name__} object is not iterable. "
    244         f"Consider using `.to_arrow()`, `.to_pandas()` or `.values_host` "
    245         f"if you wish to iterate over the values."
    246     )

TypeError: StringIndex object is not iterable. Consider using `.to_arrow()`, `.to_pandas()` or `.values_host` if you wish to iterate over the values.

Expected behavior Pivot_table succeeds as documented.

Environment overview (please complete the following information) Installed cuDF using pip, using the stable release:

pip install \
    --extra-index-url=https://pypi.nvidia.com \
    cudf-cu12==23.12.* dask-cudf-cu12==23.12.* cuml-cu12==23.12.* \
    cugraph-cu12==23.12.* cuspatial-cu12==23.12.* cuproj-cu12==23.12.* \
    cuxfilter-cu12==23.12.* cucim-cu12==23.12.* pylibraft-cu12==23.12.* \
    raft-dask-cu12==23.12.*

Environment details

<details><summary>Click here to see environment details</summary><pre>

     **git***
fatal: your current branch 'master' does not have any commits yet
     **git submodules***

     ***OS Information***
     NAME="Red Hat Enterprise Linux"
     VERSION="8.8 (Ootpa)"
     ID="rhel"
     ID_LIKE="fedora"
     VERSION_ID="8.8"
     PLATFORM_ID="platform:el8"
     PRETTY_NAME="Red Hat Enterprise Linux 8.8 (Ootpa)"
     ANSI_COLOR="0;31"
     CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
     HOME_URL="https://www.redhat.com/"
     DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
     BUG_REPORT_URL="https://bugzilla.redhat.com/"

     REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
     REDHAT_BUGZILLA_PRODUCT_VERSION=8.8
     REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
     REDHAT_SUPPORT_PRODUCT_VERSION="8.8"
     Red Hat Enterprise Linux release 8.8 (Ootpa)
     Red Hat Enterprise Linux release 8.8 (Ootpa)
     Linux c1000a-s23.ufhpc 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

     ***GPU Information***
     Tue Jan 30 11:09:21 2024
     +---------------------------------------------------------------------------------------+
     | NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
     |-----------------------------------------+----------------------+----------------------+
     | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
     |                                         |                      |               MIG M. |
     |=========================================+======================+======================|
     |   0  NVIDIA A100-SXM4-80GB          On  | 00000000:07:00.0 Off |                    0 |
     | N/A   25C    P0              56W / 400W |      4MiB / 81920MiB |      0%      Default |
     |                                         |                      |             Disabled |
     +-----------------------------------------+----------------------+----------------------+
     |   1  NVIDIA A100-SXM4-80GB          On  | 00000000:0F:00.0 Off |                    0 |
     | N/A   26C    P0              57W / 400W |      4MiB / 81920MiB |      0%      Default |
     |                                         |                      |             Disabled |
     +-----------------------------------------+----------------------+----------------------+
     |   2  NVIDIA A100-SXM4-80GB          On  | 00000000:47:00.0 Off |                    0 |
     | N/A   24C    P0              54W / 400W |      4MiB / 81920MiB |      0%      Default |
     |                                         |                      |             Disabled |
     +-----------------------------------------+----------------------+----------------------+
     |   3  NVIDIA A100-SXM4-80GB          On  | 00000000:4E:00.0 Off |                    0 |
     | N/A   24C    P0              56W / 400W |      4MiB / 81920MiB |      0%      Default |
     |                                         |                      |             Disabled |
     +-----------------------------------------+----------------------+----------------------+
     |   4  NVIDIA A100-SXM4-80GB          On  | 00000000:87:00.0 Off |                    0 |
     | N/A   29C    P0              67W / 400W |    583MiB / 81920MiB |     40%      Default |
     |                                         |                      |             Disabled |
     +-----------------------------------------+----------------------+----------------------+
     |   5  NVIDIA A100-SXM4-80GB          On  | 00000000:90:00.0 Off |                    0 |
     | N/A   45C    P0             177W / 400W |    775MiB / 81920MiB |     94%      Default |
     |                                         |                      |             Disabled |
     +-----------------------------------------+----------------------+----------------------+
     |   6  NVIDIA A100-SXM4-80GB          On  | 00000000:B7:00.0 Off |                    0 |
     | N/A   60C    P0             338W / 400W |  76523MiB / 81920MiB |    100%      Default |
     |                                         |                      |             Disabled |
     +-----------------------------------------+----------------------+----------------------+
     |   7  NVIDIA A100-SXM4-80GB          On  | 00000000:BD:00.0 Off |                    0 |
     | N/A   28C    P0              54W / 400W |      4MiB / 81920MiB |      0%      Default |
     |                                         |                      |             Disabled |
     +-----------------------------------------+----------------------+----------------------+

     +---------------------------------------------------------------------------------------+
     | Processes:                                                                            |
     |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
     |        ID   ID                                                             Usage      |
     |=======================================================================================|
     |    4   N/A  N/A   2669759      C   python3                                     570MiB |
     |    5   N/A  N/A   1903237      C   pmemd.cuda_SPFP                             762MiB |
     |    6   N/A  N/A   1446394      C   python                                    76510MiB |
     +---------------------------------------------------------------------------------------+

     ***CPU***
     Architecture:        x86_64
     CPU op-mode(s):      32-bit, 64-bit
     Byte Order:          Little Endian
     CPU(s):              128
     On-line CPU(s) list: 0-127
     Thread(s) per core:  1
     Core(s) per socket:  64
     Socket(s):           2
     NUMA node(s):        8
     Vendor ID:           AuthenticAMD
     CPU family:          23
     Model:               49
     Model name:          AMD EPYC 7742 64-Core Processor
     Stepping:            0
     CPU MHz:             3386.055
     CPU max MHz:         2250.0000
     CPU min MHz:         1500.0000
     BogoMIPS:            4491.84
     Virtualization:      AMD-V
     L1d cache:           32K
     L1i cache:           32K
     L2 cache:            512K
     L3 cache:            16384K
     NUMA node0 CPU(s):   0-15
     NUMA node1 CPU(s):   16-31
     NUMA node2 CPU(s):   32-47
     NUMA node3 CPU(s):   48-63
     NUMA node4 CPU(s):   64-79
     NUMA node5 CPU(s):   80-95
     NUMA node6 CPU(s):   96-111
     NUMA node7 CPU(s):   112-127
     Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es

     ***CMake***
     /apps/jupyter/6.5.4/bin/cmake
./print_env.sh: /apps/jupyter/6.5.4/bin/cmake: /apps/jupyter/6.5.4/bin/python3.11: bad interpreter: No such file or directory

     ***g++***
     /usr/bin/g++
     g++ (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18)
     Copyright (C) 2018 Free Software Foundation, Inc.
     This is free software; see the source for copying conditions.  There is NO
     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

     ***nvcc***
     /apps/compilers/cuda/12.2.2/bin/nvcc
     nvcc: NVIDIA (R) Cuda compiler driver
     Copyright (c) 2005-2023 NVIDIA Corporation
     Built on Tue_Aug_15_22:02:13_PDT_2023
     Cuda compilation tools, release 12.2, V12.2.140
     Build cuda_12.2.r12.2/compiler.33191640_0

     ***Python***
     /blue/ptighe-rapidsai/pvnick/rapids-test/rapids-test/bin/python
     Python 3.10.12

     ***Environment Variables***
     PATH                            : /apps/compilers/cuda/12.2.2/bin:/blue/ptighe-rapidsai/pvnick/rapids-test/rapids-test/bin:/opt/slurm/bin:/usr/local/cuda/bin:/opt/bin:/apps/jupyter/6.5.4/bin:/apps/ufrc/ufhpc/bin:/apps/git/2.30.1/bin:/home/pvnick/.local/bin:/home/pvnick/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin:/bin
     LD_LIBRARY_PATH                 : /apps/compilers/cuda/12.2.2/lib64:/opt/slurm/lib64::
     NUMBAPRO_NVVM                   :
     NUMBAPRO_LIBDEVICE              :
     CONDA_PREFIX                    :
     PYTHON_PATH                     :

     conda not found
     ***pip packages***
     /blue/ptighe-rapidsai/pvnick/rapids-test/rapids-test/bin/pip
     Package                   Version
     ------------------------- ---------------
     aiohttp                   3.9.3
     aiosignal                 1.3.1
     anyio                     4.2.0
     argon2-cffi               23.1.0
     argon2-cffi-bindings      21.2.0
     arrow                     1.3.0
     asttokens                 2.4.1
     async-lru                 2.0.4
     async-timeout             4.0.3
     attrs                     23.2.0
     Babel                     2.14.0
     beautifulsoup4            4.12.3
     bleach                    6.1.0
     bokeh                     3.3.4
     cachetools                5.3.2
     certifi                   2023.11.17
     cffi                      1.16.0
     charset-normalizer        3.3.2
     click                     8.1.7
     click-plugins             1.1.1
     cligj                     0.7.2
     cloudpickle               3.0.0
     colorcet                  3.0.1
     comm                      0.2.1
     contourpy                 1.2.0
     cucim-cu12                23.12.1
     cuda-python               12.3.0
     cudf-cu12                 23.12.1
     cugraph-cu12              23.12.0
     cuml-cu12                 23.12.0
     cuproj-cu12               23.12.1
     cupy-cuda12x              13.0.0
     cuspatial-cu12            23.12.1
     cuxfilter-cu12            23.12.0
     dask                      2023.11.0
     dask-cuda                 23.12.0
     dask-cudf-cu12            23.12.0
     datashader                0.16.0
     debugpy                   1.8.0
     decorator                 5.1.1
     defusedxml                0.7.1
     distributed               2023.11.0
     exceptiongroup            1.2.0
     executing                 2.0.1
     fastjsonschema            2.19.1
     fastrlock                 0.8.2
     fiona                     1.9.5
     fqdn                      1.5.1
     frozenlist                1.4.1
     fsspec                    2023.12.2
     geopandas                 0.14.2
     holoviews                 1.18.1
     idna                      3.6
     imageio                   2.33.1
     importlib-metadata        7.0.1
     ipykernel                 6.29.0
     ipython                   8.20.0
     ipywidgets                8.1.1
     isoduration               20.11.0
     jedi                      0.19.1
     Jinja2                    3.1.3
     joblib                    1.3.2
     json5                     0.9.14
     jsonpointer               2.4
     jsonschema                4.21.1
     jsonschema-specifications 2023.12.1
     jupyter                   1.0.0
     jupyter_client            8.6.0
     jupyter-console           6.6.3
     jupyter_core              5.7.1
     jupyter-events            0.9.0
     jupyter-lsp               2.2.2
     jupyter_server            2.12.5
     jupyter_server_proxy      4.1.0
     jupyter_server_terminals  0.5.2
     jupyterlab                4.0.11
     jupyterlab_pygments       0.3.0
     jupyterlab_server         2.25.2
     jupyterlab-widgets        3.0.9
     lazy_loader               0.3
     linkify-it-py             2.0.2
     llvmlite                  0.40.1
     locket                    1.0.0
     Markdown                  3.5.2
     markdown-it-py            3.0.0
     MarkupSafe                2.1.4
     matplotlib-inline         0.1.6
     mdit-py-plugins           0.4.0
     mdurl                     0.1.2
     mistune                   3.0.2
     msgpack                   1.0.7
     multidict                 6.0.4
     multipledispatch          1.0.0
     nbclient                  0.9.0
     nbconvert                 7.14.2
     nbformat                  5.9.2
     nest-asyncio              1.6.0
     networkx                  3.2.1
     notebook                  7.0.7
     notebook_shim             0.2.3
     numba                     0.57.1
     numpy                     1.24.4
     nvtx                      0.2.8
     overrides                 7.7.0
     packaging                 23.2
     pandas                    1.5.3
     pandocfilters             1.5.1
     panel                     1.3.8
     param                     2.0.2
     parso                     0.8.3
     partd                     1.4.1
     pexpect                   4.9.0
     pillow                    10.2.0
     pip                       23.0.1
     platformdirs              4.1.0
     prometheus-client         0.19.0
     prompt-toolkit            3.0.43
     protobuf                  4.25.2
     psutil                    5.9.8
     ptyprocess                0.7.0
     pure-eval                 0.2.2
     pyarrow                   14.0.2
     pycparser                 2.21
     pyct                      0.5.0
     Pygments                  2.17.2
     pylibcugraph-cu12         23.12.0
     pylibraft-cu12            23.12.0
     pynvml                    11.4.1
     pyproj                    3.6.1
     python-dateutil           2.8.2
     python-json-logger        2.0.7
     pytz                      2023.4
     pyviz_comms               3.0.1
     PyWavelets                1.5.0
     PyYAML                    6.0.1
     pyzmq                     25.1.2
     qtconsole                 5.5.1
     QtPy                      2.4.1
     raft-dask-cu12            23.12.0
     rapids-dask-dependency    23.12.1
     referencing               0.33.0
     requests                  2.31.0
     rfc3339-validator         0.1.4
     rfc3986-validator         0.1.1
     rich                      13.7.0
     rmm-cu12                  23.12.0
     rpds-py                   0.17.1
     scikit-image              0.21.0
     scipy                     1.12.0
     Send2Trash                1.8.2
     setuptools                65.5.0
     shapely                   2.0.2
     simpervisor               1.0.0
     six                       1.16.0
     sniffio                   1.3.0
     sortedcontainers          2.4.0
     soupsieve                 2.5
     stack-data                0.6.3
     tblib                     3.0.0
     terminado                 0.18.0
     tifffile                  2024.1.30
     tinycss2                  1.2.1
     tomli                     2.0.1
     toolz                     0.12.1
     tornado                   6.4
     tqdm                      4.66.1
     traitlets                 5.14.1
     treelite                  3.9.1
     treelite-runtime          3.9.1
     types-python-dateutil     2.8.19.20240106
     typing_extensions         4.9.0
     uc-micro-py               1.0.2
     ucx-py-cu12               0.35.0
     uri-template              1.3.0
     urllib3                   2.1.0
     wcwidth                   0.2.13
     webcolors                 1.13
     webencodings              0.5.1
     websocket-client          1.7.0
     widgetsnbextension        4.0.9
     xarray                    2024.1.1
     xyzservices               2023.10.1
     yarl                      1.9.4
     zict                      3.0.0
     zipp                      3.17.0

[notice] A new release of pip is available: 23.0.1 -> 23.3.2
[notice] To update, run: pip install --upgrade pip

</pre></details>

beckernick commented 7 months ago

Hi @pvnick , thanks for the report. We'll investigate and follow up on this issue.

bdice commented 6 months ago

For context, the decision to disallow iteration over GPU objects is intentional -- it keeps users from accidentally triggering many host-device transfers (e.g. in a for loop) that are highly inefficient. This is problematic in some cases when column names are part of an object on the GPU that needs to be iterated over. The solution to this will likely require some code change in dask-cudf to convert the StringIndex into a type that is supported on the host.

wence- commented 6 months ago

While it is inefficient to iterate row-wise over the dataframe, it's pretty difficult to adapt all of dask-dataframe to do something different based on cudf/pandas. Note we can't really do this in dask-cudf without monkey-patching and/or reimplementing dask.dataframe.pivot_table.

I'm not sure the iteration is that inefficient, if we implemented it as (for a stringindex)

def __iter__(self):
    return iter(self.to_pandas())

There's only one device-to-host copy

vyasr commented 4 months ago

I am leaning towards the same view as Lawrence here. We've had these disabled code paths for a long time, and while I understand the rationale I think at this point I'm OK with relaxing this behavior. Especially in light of cudf.pandas or dask integration, disabling a code path in a way that breaks those weights seems less favorable than it may once have.

bdice commented 4 months ago

I’m okay with that proposal. My comments above were primarily to establish historical context — I am alright with changing the behavior to solve compatibility issues.

rapidsai / cudf

[BUG] dask_cudf pivot_table function is broken: TypeError: StringIndex object is not iterable. #14935