Apparent segmentation bug when defining user defined symbols

howlinghuffy commented 6 years ago

I'm not sure if this is a bug or by design, but I am experiencing some weird segmentation behaviour when using --user_defined_symbols to train sentencepiece.

It seems that sentencepiece does not take these symbols into account when training, and instead actually converts them to spaces for the segmentation process. This results in some subword tokens being included in the generated vocabulary, despite never actually appearing in the training text.

My input data is annotated with @ characters throughout to represent preprocessing information. I want to define @ as a user defined symbol.

For example, many of my training sentences are in a format similar to the following:

This is a sample @218@sentence.
This is @242@another sample @218@sentence.
This is a @218@third sample sentence.
This is a @218@fourth sample sentence.

I train with the following command:

spm_train --input=input.txt --model_prefix=m --vocab_size=4000 --user_defined_symbols=@

However, the vocabulary generated is not what I would expect.

Expected Vocab

<unk>   0
@   0
▁   -1.94486
218 -2.94693
s   -3.29286
▁the    -3.76292
.   -4.09046
,   -4.14452
▁a  -4.18573
▁of -4.3175
▁and    -4.55276
242 -4.63683
▁in -4.67848
▁t  -4.71242
▁T  -4.79053
and so on...

Actual Vocab

<unk>   0
@   0
▁   -1.94486
▁218    -2.94693
s   -3.29286
▁the    -3.76292
.   -4.09046
,   -4.14452
▁a  -4.18573
▁of -4.3175
▁and    -4.55276
▁242    -4.63683
▁in -4.67848
▁t  -4.71242
▁T  -4.79053
and so on...

As you can see, two of the top ranking tokens in the actual vocab are ▁218 and ▁242, despite the numbers 218 and 242 never being preceeded by a space in the training data. Intuitively, I would expect the tokens 218 and 242 (without a space preceeding them) to be high in the vocabulary instead .

Encoding still separates the user_defined_symbols as expected, but unfortunately the vocabulary contains useless tokens (▁218 and ▁242) and is missing desirable tokens ( 218 and 242), meaning that they are suboptimally encoded as 21 81 and2 42` respectively.

Expected Encoding

▁This ▁is ▁a ▁sample ▁ @ 218 @ s ent ence .
▁This ▁is ▁ @ 242 @ an other ▁sample ▁ @ 218 @ s ent ence .
▁This ▁is ▁a ▁ @ 218 @ th ir d ▁sample ▁sentence .
▁This ▁is ▁a ▁ @ 218 @ f our th ▁sample ▁sentence .

Actual Encoding

▁This ▁is ▁a ▁sample ▁ @ 21 8 @ s ent ence .
▁This ▁is ▁ @ 2 42 @ an other ▁sample ▁ @ 21 8 @ s ent ence .
▁This ▁is ▁a ▁ @ 21 8 @ th ir d ▁sample ▁sentence .
▁This ▁is ▁a ▁ @ 21 8 @ f our th ▁sample ▁sentence .

Is this an expected behaviour? And if not, is there an easy fix for this?

Thanks in advance!

taku910 commented 6 years ago

Seems this is expected behavior. When '@' is specified as user defined symbol, '@' is always treated as one piece, meaning that no piece containing '@' inside will not be extracted. Probably the name of 'symbol' is misleading. The intention is user-defined-piece.

It would be tricky to perform the expected encoding, but one solution would be --user_defined_symbols=@218@,@242@

Then, @218@ and @242@ are encoded as one piece.

howlinghuffy commented 6 years ago

Thanks for the quick response @taku910.

That doesn't quite answer the issue I'm having - I expect the '@' to be treated as a single piece, and I don't expect any other piece to contain an '@' symbol.

What is confusing me is the fact that ▁218 and ▁242 are extracted as pieces (note that these pieces have a space symbol at their beginning), despite the numbers 218 and 242 never occuring in the data with a space directly before them. These numbers only ever occur in the data with an @ symbol on either side of them. As such, I would expect the numbers 218 and 242 to be extracted as pieces, but without a leading ▁ character.

Do you know why the extracted pieces 218 and 242 would include a space character before them?

taku910 commented 6 years ago

OK, I got it and reproduce this bug on my environment.

During the training, user-defined-symbols are simply replaced with ' ', so

This is a sample @218@sentence. is treated as This is a sample 218 sentence.

This causes a bug. I would like to fix but not sure how it is easy. Anyway, thank you for the report.

howlinghuffy commented 6 years ago

Thanks for tracking it down @taku910, that makes sense.

Instead of replacing the symbols with '', would it perhaps be better to use the user-defined-symbols to split the sentence into multiple sentences for training?

This would mean that This is a sample @218@sentence is treated as 3 sentences:

This is a sample 218 sentence

Would that work? Unfortunately I am not well-versed in C, so I may not be too helpful in contributing a patch for that modification.

taku910 commented 6 years ago

Thank you for the suggestion. Actually, '\t' is reserved for a piece boundary marker in sentencepiece. Just replacing user defined symbols with '\t' works.

https://github.com/google/sentencepiece/pull/237

taku910 commented 6 years ago

Let me close this bug. Please reopen it if you find any issues.

google / sentencepiece