[BUG] Subword tokenizer on rows divisible by 64 gives incorrect results

VibhuJawa commented 4 years ago

Describe the bug Subword tokenizer on ~1_000_000~ any rows divisibly by 64 gives incorrect results .

Steps/Code to reproduce bug

Set up Vocab

# !rm -rf *.txt
# !wget https://raw.githubusercontent.com/rapidsai/clx/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py
# !wget https://cdn.huggingface.co/bert-base-uncased-vocab.txt
# !python3 perfect_hash.py  --vocab 'bert-base-uncased-vocab.txt' --output 'vocab-hash.txt' --compact

Actual Reprouducer:


import cudf
n_rows=999_999+1
text = "I" 
cudf_ser = cudf.Series([text]*n_rows)
cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",
                                                             do_lower=True,
                                                             do_truncate=True,
                                                             max_rows_tensor=n_rows
                                                            )
print(cudf_tokens[0:10])

[100   0   0   0   0   0   0   0   0   0]

Expected behavior

I would expect similar results as I get with 999_999 rows.


import cudf
n_rows=999_999
text = "I" 
cudf_ser = cudf.Series([text]*n_rows)
cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",
                                                             do_lower=True,
                                                             do_truncate=True,
                                                             max_rows_tensor=n_rows
                                                            )
print(cudf_tokens[0:10])

[1045    0    0    0    0    0    0    0    0    0]

Environment overview (please complete the following information)

Method of cuDF install: [conda]

# packages in environment at /raid/vjawa/conda/envs/cudf_oct_12:
cudf                      0.16.0a201012   cuda_10.2_py37_g421fddb8b5_1970    rapidsai-nightly
libcudf                   0.16.0a201012   cuda10.2_g421fddb8b5_1970    rapidsai-nightly

CC: @davidwendt ,

CC: @BartleyR / @raykallen (FYI: In case you guys run into the same issue)

VibhuJawa commented 4 years ago

Some further triaging revealed that it's actually divisibility by 64 which is the problem.

64 rows divsibilty incorrect results :

(Correct result is [1045])

import cudf
n_rows_ls=[32,64,128,256,512]
for n_rows in n_rows_ls:
    text = "I" 
    cudf_ser = cudf.Series([text]*n_rows)
    cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",
                                                                 do_lower=True,
                                                                 max_rows_tensor=n_rows
                                                                )
    print(cudf_tokens[0:10])

[1045    0    0    0    0    0    0    0    0    0]
[ 3523 28954 28954 28954 28954 28954 28954 28954 28954 28954]
[ 3523 28954 28954 28954 28954 28954 28954 28954 28954 28954]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]

Another Bigger Test

import cudf
n_rows_ls=[64*i for i in range(1,20)]
for n_rows in n_rows_ls:
    text = "I" 
    cudf_ser = cudf.Series([text]*n_rows)
    cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",
                                                                 do_lower=True,
                                                                 max_rows_tensor=n_rows
                                                                )
    print(cudf_tokens[0:10])


[ 3523 28954 28954 28954 28954 28954 28954 28954 28954 28954]
[ 3523 28954 28954 28954 28954 28954 28954 28954 28954 28954]
[ 3523 28954 28954 28954 28954 28954 28954 28954 28954 28954]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]
[100   0   0   0   0   0   0   0   0   0]

Results are correct is n_rows is not divisible by 64:

import cudf
n_rows_ls=[33,65,129,257,513]
for n_rows in n_rows_ls:
    text = "I" 
    cudf_ser = cudf.Series([text]*n_rows)
    cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",
                                                                 do_lower=True,
                                                                 max_rows_tensor=n_rows
                                                                )
    print(cudf_tokens[0:10])

[1045    0    0    0    0    0    0    0    0    0]
[1045    0    0    0    0    0    0    0    0    0]
[1045    0    0    0    0    0    0    0    0    0]
[1045    0    0    0    0    0    0    0    0    0]
[1045    0    0    0    0    0    0    0    0    0]

davidwendt commented 4 years ago

Fixed in #6519

rapidsai / cudf