google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.23k stars 570 forks source link

tokenization.py - WordpieceTokenizer uses "##", but Sentencepiece vocab generates "▁" #196

Open abdullaholuk opened 4 years ago

abdullaholuk commented 4 years ago

Hello, thanks for the great contribution!

I generated a vocab via Huggingface Sentencepiece BPE Tokenizer. But, it does not generate a spm model file.

According these lines:

def tokenize(self, text): if self.sp_model: split_tokens = encode_pieces(self.sp_model, text, return_unicode=False) else: split_tokens = [] for token in self.basic_tokenizer.tokenize(text): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token)

Tokenizer uses Wordpiece tokenizer when there is no spm model. But in WordPieceTokenizer, it uses "##" for starting string. Sentencepiece uses "▁". Can you bring an option for this? I changed manually "##" to "▁", but many may not be noticed that.

Otherwise, tokenizer generates something like " [CLS] [UNK] [UNK] [UNK] [UNK] bir [UNK] [UNK] [UNK] [UNK] [UNK] , [UNK] [UNK] [UNK] bir [UNK] [MASK] [UNK] [UNK] . mo [UNK] ' de [MASK] [UNK] [UNK] , [UNK] ve [UNK] [UNK] ne [UNK] [UNK] , [UNK] [UNK] [UNK] [UNK] [MASK] [MASK] [UNK] . bu [UNK] , [UNK] [UNK] [UNK] [UNK] [MASK] [MASK] [MASK] [UNK] [UNK] bir [UNK] [UNK] ve [UNK] [UNK] . [SEP] uc [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] olan [UNK] , tek [MASK] [MASK] [UNK] - [MASK] [MASK] [UNK] . [MASK] [UNK] - [UNK] [UNK] [UNK] [UNK] , [UNK] [MASK] [UNK] [MASK] [MASK] [UNK] [UNK] [UNK] ▁ilki . [UNK] daha [UNK] [UNK] [UNK] [MASK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] . [SEP]" and clearly there is no information for Albert.

abdullaholuk commented 4 years ago

class WordpieceTokenizer(object): def tokenize(self, text): text = convert_to_unicode(text)

output_tokens = []
for token in whitespace_tokenize(text):
  chars = list(token)
  if len(chars) > self.max_input_chars_per_word:
    output_tokens.append(self.unk_token)
    continue

  is_bad = False
  start = 0
  sub_tokens = []
  while start < len(chars):
    end = len(chars)
    cur_substr = None
    while start < end:
      substr = "".join(chars[start:end])
      if start == 0:
        substr = "▁" + six.ensure_str(substr)
      if substr in self.vocab:
        cur_substr = substr
        break
      end -= 1
    if cur_substr is None:
      is_bad = True
      break
    sub_tokens.append(cur_substr)
    start = end

  if is_bad:
    output_tokens.append(self.unk_token)
  else:
    output_tokens.extend(sub_tokens)
return output_tokens