Open vyasr opened 1 year ago
I think long term we want to move away from hash_vocab
functionality and make subword
tokenizer work with vocab
files directly.
Similar to what we do in BPE
Hmm OK so you think we'll end up removing this functionality altogether at some point, then?
Describe the bug The hash vocab test in cudf currently warns about an overflow occurring. This can be easily observed by running the pytest with warnings set to raise errors.
Steps/Code to reproduce bug Execute
pytest -W error python/cudf/cudf/tests/test_hash_vocab.py::test_correct_bert_base_vocab_hash
from the root of the repository.The output should include a traceback like this:
Expected behavior We should not have overflows occurring. The reason for the overflow is that all the inputs to
_hash_func
are being converted tonp.uint64
(limited to 64 bits) rather than primitive Python ints (which have unlimited precision). I attempted the naive modification of just removing the conversions tonp.uint64
here (which also requires rewriting some of the call sites to do conversions since they involve indexing into numpy arrays or adding numpy ints to Python ints), but my quick conversion led to the test failing outright. I didn't check my work all that thoroughly so it's possible I made an error, but we should make sure that we understand whether the numpy integer overflow here is some property that we are depending on implicitly, if it is a bug that users could actually hit and we need to fix, or if it's just the expected behavior and the warning can be silenced.