rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.02k stars 871 forks source link

[BUG] StringMethods - Jaccard-index fails with long strings #16157

Open ayushdg opened 1 week ago

ayushdg commented 1 week ago

Describe the bug Calling jaccard_index on long strings leads to OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit

Steps/Code to reproduce bug

import cudf
import numpy as np

test_string = "a" *(np.iinfo(np.int32).max // 10)
df = cudf.Series([test_string] * 11)
res = df.str.jaccard_index(input=df, width=5)

Results in:

File /opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/column/string.py:5378, in StringMethods.jaccard_index(self, input, width)
   5353 def jaccard_index(self, input: cudf.Series, width: int) -> SeriesOrIndex:
   5354     """
   5355     Compute the Jaccard index between this column and the given
   5356     input strings column.
   (...)
   5374     dtype: float32
   5375     """
   5377     return self._return_or_inplace(
-> 5378         libstrings.jaccard_index(self._column, input._column, width),
   5379     )

File /opt/conda/envs/rapids/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File jaccard.pyx:26, in cudf._lib.nvtext.jaccard.jaccard_index()

OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit

Expected behavior Perhaps it is expected for long string to not work with this method since I don't see it on the #13048, but it would good to get conformation.

Environment overview (please complete the following information)

Environment details Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context Add any other context about the problem here.

davidwendt commented 6 days ago

The jaccard API uses hash_character_ngrams internally which produces a list column of integer values. The total number of integers in that list column is the number of ngrams for this strings column. The number of integers exceeds the max size_type and so the function is unable to build the output list column.

So you would need to limit the strings column size so the total number of generated ngrams would not exceed max size_type/int32 individual strings. Meanwhile I can work on a modifying jaccard to avoid this limit since it is an internal detail for that API.