huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k stars 745 forks source link

StripAccents doesn't work #1496

Closed NivinaNull closed 1 month ago

NivinaNull commented 2 months ago

I got these following few lines code to test StripAccents, but it turned out not working """ from tokenizers import normalizers from tokenizers.normalizers import Strip, StripAccents normalizer = normalizers.Sequence([Strip(), StripAccents()]) print(normalizer.normalize_str("Héllò hôw are ü? ")) """

ArthurZucker commented 3 weeks ago

I can indeed reproduce but I think it is expected, this works: normalizer = normalizers.Sequence([normalizers.NFKD(), StripAccents()]). >>> Hello how are u? e