Closed howlinghuffy closed 6 years ago
Seems this is expected behavior. When '@' is specified as user defined symbol, '@' is always treated as one piece, meaning that no piece containing '@' inside will not be extracted. Probably the name of 'symbol' is misleading. The intention is user-defined-piece
.
It would be tricky to perform the expected encoding, but one solution would be
--user_defined_symbols=@218@,@242@
Then, @218@ and @242@ are encoded as one piece.
Thanks for the quick response @taku910.
That doesn't quite answer the issue I'm having - I expect the '@' to be treated as a single piece, and I don't expect any other piece to contain an '@' symbol.
What is confusing me is the fact that ▁218 and ▁242 are extracted as pieces (note that these pieces have a space symbol at their beginning), despite the numbers 218 and 242 never occuring in the data with a space directly before them. These numbers only ever occur in the data with an @ symbol on either side of them. As such, I would expect the numbers 218 and 242 to be extracted as pieces, but without a leading ▁ character.
Do you know why the extracted pieces 218 and 242 would include a space character before them?
OK, I got it and reproduce this bug on my environment.
During the training, user-defined-symbols are simply replaced with ' ', so
This is a sample @218@sentence
. is treated as This is a sample 218 sentence.
This causes a bug. I would like to fix but not sure how it is easy. Anyway, thank you for the report.
Thanks for tracking it down @taku910, that makes sense.
Instead of replacing the symbols with '', would it perhaps be better to use the user-defined-symbols to split the sentence into multiple sentences for training?
This would mean that This is a sample @218@sentence
is treated as 3 sentences:
This is a sample
218
sentence
Would that work? Unfortunately I am not well-versed in C, so I may not be too helpful in contributing a patch for that modification.
Thank you for the suggestion. Actually, '\t' is reserved for a piece boundary marker in sentencepiece. Just replacing user defined symbols with '\t' works.
Let me close this bug. Please reopen it if you find any issues.
I'm not sure if this is a bug or by design, but I am experiencing some weird segmentation behaviour when using --user_defined_symbols to train sentencepiece.
It seems that sentencepiece does not take these symbols into account when training, and instead actually converts them to spaces for the segmentation process. This results in some subword tokens being included in the generated vocabulary, despite never actually appearing in the training text.
My input data is annotated with
@
characters throughout to represent preprocessing information. I want to define@
as a user defined symbol.For example, many of my training sentences are in a format similar to the following:
I train with the following command:
spm_train --input=input.txt --model_prefix=m --vocab_size=4000 --user_defined_symbols=@
However, the vocabulary generated is not what I would expect.
Expected Vocab
Actual Vocab
As you can see, two of the top ranking tokens in the actual vocab are
▁218
and▁242
, despite the numbers 218 and 242 never being preceeded by a space in the training data. Intuitively, I would expect the tokens218
and242
(without a space preceeding them) to be high in the vocabulary instead .Encoding still separates the user_defined_symbols as expected, but unfortunately the vocabulary contains useless tokens (
▁218
and▁242
) and is missing desirable tokens (218
and242
), meaning that they are suboptimally encoded as21 81 and
2 42` respectively.Expected Encoding
Actual Encoding
Is this an expected behaviour? And if not, is there an easy fix for this?
Thanks in advance!