[BUG] Overflow potentially corrupting hashes in hash_vocab implementation

vyasr commented 1 year ago

Describe the bug The hash vocab test in cudf currently warns about an overflow occurring. This can be easily observed by running the pytest with warnings set to raise errors.

Steps/Code to reproduce bug Execute pytest -W error python/cudf/cudf/tests/test_hash_vocab.py::test_correct_bert_base_vocab_hash from the root of the repository.

The output should include a traceback like this:

test_correct_bert_base_vocab_hash ____________________________________________________________________________________

datadir = '/home/vyasr/local/rapids/cudf/python/cudf/cudf/tests/data/subword_tokenizer_data/bert_base_cased_sampled', tmpdir = local('/tmp/pytest-of-rapids/pytest-2/test_correct_bert_base_vocab_h0')

    def test_correct_bert_base_vocab_hash(datadir, tmpdir):
        # The vocabulary is drawn from bert-base-cased
        vocab_path = os.path.join(datadir, "vocab.txt")

        groundtruth_path = os.path.join(datadir, "vocab-hash.txt")
        output_path = tmpdir.join("cudf-vocab-hash.txt")
>       hash_vocab(vocab_path, output_path)

python/cudf/cudf/tests/test_hash_vocab.py:23: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
python/cudf/cudf/utils/hash_vocab_utils.py:269: in hash_vocab
    ) = _perfect_hash(keys, 10)
python/cudf/cudf/utils/hash_vocab_utils.py:129: in _perfect_hash
    internal_table, coeff_a, coeff_b = _find_hash_for_internal(b)
python/cudf/cudf/utils/hash_vocab_utils.py:102: in _find_hash_for_internal
    bins = _make_bins(hash_bin, new_length, a, b)
python/cudf/cudf/utils/hash_vocab_utils.py:60: in _make_bins
    bins[_hash_func(item, a, b, num_bins)].append(item)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

k = 233297689050786, a = 21608458564245, b = 116, size = 6

    def _hash_func(k, a, b, size):
        k = np.uint64(k)
        a = np.uint64(a)
        b = np.uint64(b)
        size = np.uint64(size)
>       return ((a * k + b) % PRIME) % size
E       RuntimeWarning: overflow encountered in ulong_scalars

python/cudf/cudf/utils/hash_vocab_utils.py:49: RuntimeWarning
------------------------------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------------------------------
Attempting to build table using 1.500000n space
Longest bin was 11
Processing bin 0 / 875 of size = 6

Expected behavior We should not have overflows occurring. The reason for the overflow is that all the inputs to _hash_func are being converted to np.uint64 (limited to 64 bits) rather than primitive Python ints (which have unlimited precision). I attempted the naive modification of just removing the conversions to np.uint64 here (which also requires rewriting some of the call sites to do conversions since they involve indexing into numpy arrays or adding numpy ints to Python ints), but my quick conversion led to the test failing outright. I didn't check my work all that thoroughly so it's possible I made an error, but we should make sure that we understand whether the numpy integer overflow here is some property that we are depending on implicitly, if it is a bug that users could actually hit and we need to fix, or if it's just the expected behavior and the warning can be silenced.

VibhuJawa commented 6 months ago

I think long term we want to move away from hash_vocab functionality and make subword tokenizer work with vocab files directly.

Similar to what we do in BPE

vyasr commented 5 months ago

Hmm OK so you think we'll end up removing this functionality altogether at some point, then?

rapidsai / cudf

[BUG] Overflow potentially corrupting hashes in hash_vocab implementation #12403