google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.23k stars 1.17k forks source link

Unable to tokenize <s>, <pad>, </s>, and <unk> correctly in Python #667

Closed ryonakamura closed 3 years ago

ryonakamura commented 3 years ago

Token to id conversion works fine, but text to ids conversion fails.

print(sp.id_to_piece(2))
print(sp.piece_to_id("</s>"))
print(sp["</s>"])
print(sp.encode("</s>"))
print(sp.decode(sp.encode("</s>")))

output:

</s>
2
2
[13, 3, 403, 488, 3]
 ⁇ /s ⁇ 

Other special tokens fail to tokenize as well.

for x in ["<s>", "<pad>", "</s>", "<unk>"]:
    print(" ".join(sp.EncodeAsPieces(x)))

output:

< s >
< pad >
< / s >
< un k >

Maybe I need to add extra_options? We set sp.SetEncodeExtraOptions("bos:eos"), which is probably a function to automatically add <s> and </s> to the tokenized text, so the tokenization of the <s> and </s> in the input text still failed. Also, <pad> and <unk> are not supported.

We cannot add </s> etc. to user_defined_symbols now since we have already trained a large model.

We are currently dealing with ad-hoc post-processing for tokenized text.

text = text.replace("< s >", "<s>")
text = text.replace("< pad >", "<pad>")
text = text.replace("< / s >", "</s>")
text = text.replace("< un k >", "<unk>")

Our Settings

training:

spm.SentencePieceTrainer.train(
    normalization_rule_name="identity",
    input=path,
    model_prefix=os.path.join(data_dir, "spm"),
    vocab_size=vocab_size,
    # 0.9995 for languages with rich character set like Japanse or Chinese.
    character_coverage=0.9995,
    # "<unk>" must not be defined with "control_symbols" and "user_defined_symbols".
    user_defined_symbols=user_defined_symbols,
    bos_id=0,
    pad_id=1,
    eos_id=2,
    unk_id=3,
)

loading:

sp = spm.SentencePieceProcessor()
sp.Load(path)
taku910 commented 3 years ago

This is expected behavior. Please see https://github.com/google/sentencepiece/issues/215

<s>, <pad>, </s>, and <unk> are defined as 'control symbols' that should not be appear in the input.

We can workaround this restriction by defining them as 'user defined symbols'. However, we don't recommend it especially in user-facing product as user can tweak the model behavior by just injecting these tokens in the input.

--user_defined_symbols='<s>,</s>,<pad>,<unk>'
LorrinWWW commented 2 years ago

Hi, do you have any idea if I could add user_defined_symbols to pretrained sp tokenizers? Thank you in advance! @taku910