neuralmind-ai / portuguese-bert

Portuguese pre-trained BERT models
Other
792 stars 122 forks source link

Pretrained tokenizer does not recognize special characters `º` `ª` #5

Closed josepsmartinez closed 4 years ago

josepsmartinez commented 4 years ago

If a word contains the special character º or ª, its token is decoded as [UNK] (unknown).

Code that demonstrates the issue:

from transformers import AutoTokenizer

model_dir = "models/portuguese-bert-lener_br"
tokenizer = AutoTokenizer.from_pretrained(model_dir)

def encode_decode(t):
    return tokenizer.decode(tokenizer.encode(t))

print(encode_decode("nº 396"))
print(encode_decode("n 396"))

print(encode_decode("Srª Maria"))
print(encode_decode("Sr Maria"))

Is there anything I am doing wrong or is the model incapable of embedding words containing such characters?

fabiocapsouza commented 4 years ago

Hi jotapem,

Sorry for the delayed response. Yes, unfortunately these symbols were not included in the generated WordPiece vocabulary.

josepsmartinez commented 4 years ago

Thank you for the answer.