kwonmha / bert-vocab-builder

Builds wordpiece(subword) vocabulary compatible for Google Research's BERT
226 stars 47 forks source link

splitting strategy in tokenize.py #14

Closed mandalbiswadip closed 4 years ago

mandalbiswadip commented 4 years ago

I was trying to use the repo for building a vocab and I realized that the encode(text) function is used to as a tokenizer. I am not sure if I am right, but I am not able to get the last token in the returned result.

def encode(text):
  """Encode a unicode string as a list of tokens.

  Args:
    text: a unicode string
  Returns:
    a list of tokens as Unicode strings
  """
  if not text:
    return []
  ret = []
  token_start = 0
  # Classify each character in the input string
  is_alnum = [c in _ALPHANUMERIC_CHAR_SET for c in text]
  add_remaining = False
  for pos in range(1, len(text)):
    add_remaining = False
    if is_alnum[pos] != is_alnum[pos - 1]:
      if not is_alnum[pos]:
        token = text[token_start:pos]
        if token != u" " or token_start == 0:
          add_remaining = False
          ret.append(token)
      else:
        add_remaining = True
        token_start = pos

  final_token = text[token_start:] if text[-1] in _ALPHANUMERIC_CHAR_SET else text[token_start:-1]
  if add_remaining:
    ret.append(final_token)
  return ret

The following is a sample result:

print(encode("knee injury present"))
>>['knee', 'injury']
kwonmha commented 4 years ago

Thank you for reporting. I'll fix it asap.

kwonmha commented 4 years ago

I fixed the issue. Check if it works.

mandalbiswadip commented 4 years ago

works