Closed VibhuJawa closed 4 years ago
Some further triaging revealed that it's actually divisibility by 64 which is the problem.
(Correct result is [1045])
import cudf
n_rows_ls=[32,64,128,256,512]
for n_rows in n_rows_ls:
text = "I"
cudf_ser = cudf.Series([text]*n_rows)
cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",
do_lower=True,
max_rows_tensor=n_rows
)
print(cudf_tokens[0:10])
[1045 0 0 0 0 0 0 0 0 0]
[ 3523 28954 28954 28954 28954 28954 28954 28954 28954 28954]
[ 3523 28954 28954 28954 28954 28954 28954 28954 28954 28954]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
Another Bigger Test
import cudf
n_rows_ls=[64*i for i in range(1,20)]
for n_rows in n_rows_ls:
text = "I"
cudf_ser = cudf.Series([text]*n_rows)
cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",
do_lower=True,
max_rows_tensor=n_rows
)
print(cudf_tokens[0:10])
[ 3523 28954 28954 28954 28954 28954 28954 28954 28954 28954]
[ 3523 28954 28954 28954 28954 28954 28954 28954 28954 28954]
[ 3523 28954 28954 28954 28954 28954 28954 28954 28954 28954]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
[100 0 0 0 0 0 0 0 0 0]
import cudf
n_rows_ls=[33,65,129,257,513]
for n_rows in n_rows_ls:
text = "I"
cudf_ser = cudf.Series([text]*n_rows)
cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",
do_lower=True,
max_rows_tensor=n_rows
)
print(cudf_tokens[0:10])
[1045 0 0 0 0 0 0 0 0 0]
[1045 0 0 0 0 0 0 0 0 0]
[1045 0 0 0 0 0 0 0 0 0]
[1045 0 0 0 0 0 0 0 0 0]
[1045 0 0 0 0 0 0 0 0 0]
Fixed in #6519
Describe the bug Subword tokenizer on ~
1_000_000
~ any rows divisibly by 64 gives incorrect results .Steps/Code to reproduce bug
Set up Vocab
Actual Reprouducer:
Expected behavior
I would expect similar results as I get with
999_999
rows.Environment overview (please complete the following information)
CC: @davidwendt ,
CC: @BartleyR / @raykallen (FYI: In case you guys run into the same issue)