CharacterBertTokenizer support for latin characters

MatheusNtg commented 3 years ago

Hi, I was using the code from https://github.com/helboukkouri/transformers/tree/add-character-bert to generate the input_ids for the model that I trained for the Portuguese language when I noticed the following situation:

I was using this code to generate the input for the model:

text = "Você e maçã"
tokenizer = CharacterBertTokenizer()
tokens = tokenizer.tokenize(text) # this return ['voce', 'e', 'maca']
input_ids = tokenizer.encode(text, return_tensors='pt')

and the value of tokens variable made me think if the CharacterBertTokenizer has the support for the latin characters such as ç, ê and ã (since they were removed by c, e and a respectively) or if I'm using that class in a wrong way, could you please clarify this to me?

helboukkouri commented 3 years ago

Hi @MatheusNtg. I understand your problem. I think that this is due to the tokenizer doing some "cleaning" operations which strips accents. I will definitely fix that but I am currently working on more pressing matters. In the meantime, I think you can bypass the tokenizer.tokenize() step altogether and simply split by spaces + lowercase the input manually 😊

MatheusNtg commented 3 years ago

@helboukkouri, Actually this isn't a bug. If we just use the parameter strip_accents=False on the CharacterBertTokenizer we process those characters correctly.

helboukkouri commented 3 years ago

Yup, that's one way to do it I guess 😊 Enjoy the model!

helboukkouri / character-bert

CharacterBertTokenizer support for latin characters #14