huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.97k stars 783 forks source link

NormalizedString.clear() broken? #1636

Open lkurlandski opened 2 weeks ago

lkurlandski commented 2 weeks ago

Hello. I think there are some problems with NormalizedString (tokenizers 0.15.2).

In the following example, append() works as expected.

from tokenizers import NormalizedString

s = NormalizedString("Hi.")  # NormalizedString(original="Hi.", normalized="Hi.")
s.append("Hello.") # NormalizedString(original="Hi.", normalized="Hi. Hello.")

After using clear(), append() no longer modifies the normalized attribute.

from tokenizers import NormalizedString

s = NormalizedString("Hi.")  # NormalizedString(original="Hi.", normalized="Hi.")
s.clear()  # NormalizedString(original="Hi.", normalized="")
s.append("Hello.")  # NormalizedString(original="Hi.", normalized="")

This is also a problem with prepend.

ArthurZucker commented 1 week ago

Indeed, would you like to have a go at it and open a PR ? 🤗