I was trying to use the repo for building a vocab and I realized that the encode(text) function is used to as a tokenizer. I am not sure if I am right, but I am not able to get the last token in the returned result.
def encode(text):
"""Encode a unicode string as a list of tokens.
Args:
text: a unicode string
Returns:
a list of tokens as Unicode strings
"""
if not text:
return []
ret = []
token_start = 0
# Classify each character in the input string
is_alnum = [c in _ALPHANUMERIC_CHAR_SET for c in text]
add_remaining = False
for pos in range(1, len(text)):
add_remaining = False
if is_alnum[pos] != is_alnum[pos - 1]:
if not is_alnum[pos]:
token = text[token_start:pos]
if token != u" " or token_start == 0:
add_remaining = False
ret.append(token)
else:
add_remaining = True
token_start = pos
final_token = text[token_start:] if text[-1] in _ALPHANUMERIC_CHAR_SET else text[token_start:-1]
if add_remaining:
ret.append(final_token)
return ret
I was trying to use the repo for building a vocab and I realized that the
encode(text)
function is used to as a tokenizer. I am not sure if I am right, but I am not able to get the last token in the returned result.The following is a sample result: