Wrong BPE algorithm implementation?

zhangguanqun commented 3 years ago

In file master/tensorflow_text/tools/wordpiece_vocab/wordpiece_tokenizer_learner_lib.py, the algorithm about choosing vocabulary is different from toturial here. The count of all prefixes of a word should be subtract only if the word is selected (count of word surpass the threshold), but the current implementation subtract the count of all prefixes of a word whether the word is selected into next step or not. I think current implementation is wrong. For example, I have (hell, 200), (hello, 50), (helle, 50), (hella, 50), and threshold is 100, obviously, word hell should be selected into next stage, but with current implementation(following code), after processing (hello, 50), (helle, 50), (hella, 50), word hell only has count 50 and will be discard.

    # Get all tokens that have a count above the threshold.
    for length in range(params.max_token_length, 0, -1):
      for token, count in subtokens[length].items():
        if count >= thresh:
          next_tokens[token] = count
        # Decrement the count of all prefixes.
        if len(token) > length:  # This token includes the joiner.
          joiner_len = len(params.joiner)
          for i in range(1 + joiner_len, length + joiner_len):
            prefix = token[0:i]
            if prefix in subtokens[i - joiner_len]:
              subtokens[i - joiner_len][prefix] -= count
        else:
          for i in range(1, length):
            prefix = token[0:i]
            if prefix in subtokens[i]:
              subtokens[i][prefix] -= count

It seems if len(token) > length: should be in if count >= thresh:.

zhangguanqun commented 3 years ago

@MarkDaoust

zhangguanqun commented 3 years ago

Any author here?

broken commented 3 years ago

Thanks for pointing this out. I'll get one of the authors to take a look, or look more deeply myself if I find time before they get to it.

tensorflow / text

Wrong BPE algorithm implementation? #583