Closed MatheusNtg closed 3 years ago
Hi @MatheusNtg. I understand your problem. I think that this is due to the tokenizer doing some "cleaning" operations which strips accents. I will definitely fix that but I am currently working on more pressing matters. In the meantime, I think you can bypass the tokenizer.tokenize()
step altogether and simply split by spaces + lowercase the input manually 😊
@helboukkouri, Actually this isn't a bug. If we just use the parameter strip_accents=False
on the CharacterBertTokenizer we process those characters correctly.
Yup, that's one way to do it I guess 😊 Enjoy the model!
Hi, I was using the code from https://github.com/helboukkouri/transformers/tree/add-character-bert to generate the input_ids for the model that I trained for the Portuguese language when I noticed the following situation:
I was using this code to generate the input for the model:
and the value of
tokens
variable made me think if theCharacterBertTokenizer
has the support for the latin characters such asç
,ê
andã
(since they were removed byc
,e
anda
respectively) or if I'm using that class in a wrong way, could you please clarify this to me?