Open ayushdg opened 1 week ago
The jaccard API uses hash_character_ngrams
internally which produces a list column of integer values. The total number of integers in that list column is the number of ngrams for this strings column. The number of integers exceeds the max size_type
and so the function is unable to build the output list column.
So you would need to limit the strings column size so the total number of generated ngrams would not exceed max size_type/int32
individual strings.
Meanwhile I can work on a modifying jaccard to avoid this limit since it is an internal detail for that API.
Describe the bug Calling jaccard_index on long strings leads to
OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit
Steps/Code to reproduce bug
Results in:
Expected behavior Perhaps it is expected for long string to not work with this method since I don't see it on the #13048, but it would good to get conformation.
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsAdditional context Add any other context about the problem here.