google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.32k stars 1.18k forks source link

With unigram algorithm, constant piece at end of each sentences does not become a token #1047

Open jogardi opened 3 months ago

jogardi commented 3 months ago

Hi thanks for your great work on this. I noticed a subtle issue when playing with synthetic examples.

The bpe algorithm works as expected but the unigram algorithm does not make this constant piece a token in the vocabulary. I generate synthetic data where each sentence is a random string followed by a constant piece.

constant_piece = 'helloWorld'
def rand_str(n=10):
    return ''.join(
        np.random.choice(list('bcegijklmnoqruvwxyz'), n)
    )

data = [rand_str() + constant_piece for _ in range(1000)]
model = io.BytesIO()
spm.SentencePieceTrainer.train(
      sentence_iterator=iter(data), model_writer=model, 
    vocab_size=1000,
    minloglevel=5, 
)
sp = spm.SentencePieceProcessor(model_proto=model.getvalue())

ex = data[20]
print([
    sp.IdToPiece(x)
    for x in sp.encode(ex, emit_unk_piece=True)
])

outputs: ['▁uy', 'vx', 'yf', 'p', 'gmn', 'he', 'llo', 'W', 'or', 'ld']

It mostly just gets random tokens. I think it gets 'he', 'llo', 'or' and 'ld' not because it noticed the repeating pattern but just by coincidently seeing it in the random strings. If I change constant_piece to '123456' then i get no tokens for the repeating pattern and only tokens for the random string: ['▁', 'gll', 'imq', 'xc', 'df', '1', '2', '3', '4', '5', '6']

This specifically because the constant_piece at the end. If I change data so that constant_piece is at the beginning of each sentence: data = [constant_piece + rand_str() for _ in range(1000)] then i get the expected result ['▁123456', 'uzb', 'ek', 'hoe', 'wr'].

TLDR; Unexpected result under the following conditions: