google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.82k stars 9.56k forks source link

accent character #886

Open lytum opened 4 years ago

lytum commented 4 years ago

hello,

in BERT tokenization.py, why are accents striped away? However, in the vocab file of multi_cased_model that supports multilingual languages, there are many accented characters.

Thanks,

quaizarv commented 4 years ago

+1