Wrong pieces for control symbols after loading SentencepieceProcessor from official model

JanKaul commented 2 years ago

I'm trying to use albert for a question answering task. Therefore I want to encode my input text with sentencepiece to use it as input for the albert model. I initialize the sentencepiece model by loading the model file from one of the official tar files. The encoding seems to work fine except for the control symbols. The input [CLS] gets encoded into three pieces. The expected behavior would be to get one piece. The same happens for [SEP].

Here is an example:

import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file='albert/albert_base_v2/albert_base/30k-clean.model')

print(sp.encode('[CLS]', out_type=str))

Ouptut: '▁[', 'CLS', ']'

Am I doing something wrong? Is there a way to specify the control symbols without training the model? I would like to avoid training the model every time I load the model. I would really appreciate your help. Thank you

Sid911 commented 2 years ago

@JanKaul That's actually true in my experimentation, I don't know why that is the case. Maybe in a little future I could train the spm model myself from the source.

JanKaul commented 2 years ago

I might have an idea why that's the case. You have to distinguish between user defined symbols and control symbols. According to the documentation:

user defined symbols: Always treated as one token in any context. These symbols can appear in the input sentence.
control symbol: We only reserve ids for these tokens. Even if these tokens appear in the input text, they are not handled as one token. User needs to insert ids explicitly after encoding.

I think [CLS] and [SEP] are added as control symbols and have to be added manually after encoding.

google-research / albert

Wrong pieces for control symbols after loading SentencepieceProcessor from official model #247