Open abdullaholuk opened 4 years ago
class WordpieceTokenizer(object): def tokenize(self, text): text = convert_to_unicode(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start == 0:
substr = "▁" + six.ensure_str(substr)
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
Hello, thanks for the great contribution!
I generated a vocab via Huggingface Sentencepiece BPE Tokenizer. But, it does not generate a spm model file.
According these lines:
def tokenize(self, text): if self.sp_model: split_tokens = encode_pieces(self.sp_model, text, return_unicode=False) else: split_tokens = [] for token in self.basic_tokenizer.tokenize(text): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token)
Tokenizer uses Wordpiece tokenizer when there is no spm model. But in WordPieceTokenizer, it uses "##" for starting string. Sentencepiece uses "▁". Can you bring an option for this? I changed manually "##" to "▁", but many may not be noticed that.
Otherwise, tokenizer generates something like " [CLS] [UNK] [UNK] [UNK] [UNK] bir [UNK] [UNK] [UNK] [UNK] [UNK] , [UNK] [UNK] [UNK] bir [UNK] [MASK] [UNK] [UNK] . mo [UNK] ' de [MASK] [UNK] [UNK] , [UNK] ve [UNK] [UNK] ne [UNK] [UNK] , [UNK] [UNK] [UNK] [UNK] [MASK] [MASK] [UNK] . bu [UNK] , [UNK] [UNK] [UNK] [UNK] [MASK] [MASK] [MASK] [UNK] [UNK] bir [UNK] [UNK] ve [UNK] [UNK] . [SEP] uc [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] olan [UNK] , tek [MASK] [MASK] [UNK] - [MASK] [MASK] [UNK] . [MASK] [UNK] - [UNK] [UNK] [UNK] [UNK] , [UNK] [MASK] [UNK] [MASK] [MASK] [UNK] [UNK] [UNK] ▁ilki . [UNK] daha [UNK] [UNK] [UNK] [MASK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] . [SEP]" and clearly there is no information for Albert.