neuralmind-ai / portuguese-bert

Portuguese pre-trained BERT models
Other
792 stars 122 forks source link

Encoding issue with HAREM dataset #26

Open jonatasgrosman opened 3 years ago

jonatasgrosman commented 3 years ago

Hi @fabiocapsouza, I think you have some encoding issues with the HAREM dataset. Take a look at the first sample of FirstHAREM-total-train.json. Words like "ASSOCIAÇÃO" are presented as "ASSOCIA\u00c7\u00c3O" no matter what encoding you try to use to open the file.

Looking at the pre-processing scripts that you've used seems that you didn't force the encoding while opening the HAREM XML files (that are originally encoded on WIndows1252, I think). That's probably the root of this encoding issue.