huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.62k stars 27.15k forks source link

trie split bug #34499

Open zpp13 opened 4 weeks ago

zpp13 commented 4 weeks ago

System Info

Who can help?

@ArthurZucker and @itazap

Information

Tasks

Reproduction

from transformers.tokenization_utils import Trie trie = Trie() trie.add("abc") trie.add("b") trie.split("ab cd")

['ab c', 'd']

Expected behavior

in my opinion, this should get ['a', 'b', 'cd'],but get ['ab c', 'd'] First submission, sorry if I understand something wrong.

ArthurZucker commented 4 weeks ago

Hey! The trie is mostly for internal usage, I think this is expected ! (we don't take spaces into account AFAIK)

zpp13 commented 4 weeks ago

This error occurs not only with spaces,for example:

截屏2024-10-31 上午9 32 29
zpp13 commented 2 weeks ago

@ArthurZucker and @itazap

ArthurZucker commented 3 days ago

Hey, sorry but the tre is not for general purpose usage, if it is breaking a tokenizer than sure let's fix it, but it does not seem to be the case!