Hi thanks for your great work on this. I noticed a subtle issue when playing with synthetic examples.
The bpe algorithm works as expected but the unigram algorithm does not make this constant piece a token in the vocabulary.
I generate synthetic data where each sentence is a random string followed by a constant piece.
constant_piece = 'helloWorld'
def rand_str(n=10):
return ''.join(
np.random.choice(list('bcegijklmnoqruvwxyz'), n)
)
data = [rand_str() + constant_piece for _ in range(1000)]
model = io.BytesIO()
spm.SentencePieceTrainer.train(
sentence_iterator=iter(data), model_writer=model,
vocab_size=1000,
minloglevel=5,
)
sp = spm.SentencePieceProcessor(model_proto=model.getvalue())
ex = data[20]
print([
sp.IdToPiece(x)
for x in sp.encode(ex, emit_unk_piece=True)
])
It mostly just gets random tokens. I think it gets 'he', 'llo', 'or' and 'ld' not because it noticed the repeating pattern but just by coincidently seeing it in the random strings. If I change constant_piece to '123456' then i get no tokens for the repeating pattern and only tokens for the random string: ['▁', 'gll', 'imq', 'xc', 'df', '1', '2', '3', '4', '5', '6']
This specifically because the constant_piece at the end. If I change data so that constant_piece is at the beginning of each sentence: data = [constant_piece + rand_str() for _ in range(1000)] then i get the expected result ['▁123456', 'uzb', 'ek', 'hoe', 'wr'].
TLDR;
Unexpected result under the following conditions:
same string at end of each sentence in the training data
Hi thanks for your great work on this. I noticed a subtle issue when playing with synthetic examples.
The bpe algorithm works as expected but the unigram algorithm does not make this constant piece a token in the vocabulary. I generate synthetic data where each sentence is a random string followed by a constant piece.
outputs: ['▁uy', 'vx', 'yf', 'p', 'gmn', 'he', 'llo', 'W', 'or', 'ld']
It mostly just gets random tokens. I think it gets 'he', 'llo', 'or' and 'ld' not because it noticed the repeating pattern but just by coincidently seeing it in the random strings. If I change constant_piece to '123456' then i get no tokens for the repeating pattern and only tokens for the random string:
['▁', 'gll', 'imq', 'xc', 'df', '1', '2', '3', '4', '5', '6']
This specifically because the constant_piece at the end. If I change data so that constant_piece is at the beginning of each sentence:
data = [constant_piece + rand_str() for _ in range(1000)]
then i get the expected result['▁123456', 'uzb', 'ek', 'hoe', 'wr']
.TLDR; Unexpected result under the following conditions: