In file master/tensorflow_text/tools/wordpiece_vocab/wordpiece_tokenizer_learner_lib.py, the algorithm about choosing vocabulary is different from toturial here.
The count of all prefixes of a word should be subtract only if the word is selected (count of word surpass the threshold), but the current implementation subtract the count of all prefixes of a word whether the word is selected into next step or not.
I think current implementation is wrong. For example, I have (hell, 200), (hello, 50), (helle, 50), (hella, 50), and threshold is 100, obviously, word hell should be selected into next stage, but with current implementation(following code), after processing (hello, 50), (helle, 50), (hella, 50), word hell only has count 50 and will be discard.
# Get all tokens that have a count above the threshold.
for length in range(params.max_token_length, 0, -1):
for token, count in subtokens[length].items():
if count >= thresh:
next_tokens[token] = count
# Decrement the count of all prefixes.
if len(token) > length: # This token includes the joiner.
joiner_len = len(params.joiner)
for i in range(1 + joiner_len, length + joiner_len):
prefix = token[0:i]
if prefix in subtokens[i - joiner_len]:
subtokens[i - joiner_len][prefix] -= count
else:
for i in range(1, length):
prefix = token[0:i]
if prefix in subtokens[i]:
subtokens[i][prefix] -= count
It seems if len(token) > length: should be in if count >= thresh:.
In file master/tensorflow_text/tools/wordpiece_vocab/wordpiece_tokenizer_learner_lib.py, the algorithm about choosing vocabulary is different from toturial here. The count of all prefixes of a word should be subtract only if the word is selected (count of word surpass the threshold), but the current implementation subtract the count of all prefixes of a word whether the word is selected into next step or not. I think current implementation is wrong. For example, I have
(hell, 200)
,(hello, 50)
,(helle, 50)
,(hella, 50)
, and threshold is 100, obviously, wordhell
should be selected into next stage, but with current implementation(following code), after processing(hello, 50)
,(helle, 50)
,(hella, 50)
, wordhell
only has count 50 and will be discard.It seems
if len(token) > length:
should be inif count >= thresh:
.