Closed ivankrylatskoe closed 4 months ago
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Ping
Sorry did not have a look.
This is a bit strange, but I think the recent update to tokenizers
will help you. You should set: prepend_scheme:'first'
:
tokenizer = Tokenizer.from_str(
"""
{
"version": "1.0",
"added_tokens": [
{
"id": 0,
"content": "<s>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
}
],
"pre_tokenizer": {
"type": "Metaspace",
"replacement": "X",
"add_prefix_space": true,
"prepend_scheme": "first"
},
"model": {
"type": "BPE",
"vocab": {
"<s>": 0,
"<": 1,
">": 2,
"X": 3,
"a": 4,
"s": 5
},
"merges": []
}
}
"""
)
result_tokens = tokenizer.encode("aaa<s><").tokens
print(result_tokens)
['X', 'a', 'a', 'a', '<s>', '<']
@ArthurZucker, hi! Thanks for your answer!
Please, check my example: print(tokenizer.encode('<s>a').tokens)
.
With latest tokenizers and your tokenizer setup I still get strange result: ['<s>', 'X', 'a']
Ping
Still no solution
I am not getting this on 0.19:
Hi!
Yes, encode
works.
But tokenizer.pre_tokenizer.pre_tokenize_str
still doesn't work.
So, the problem is not solved.
This is expected, the pre_tokenizer
does not have access to the information about the special tokens, so it will always prepend regardless of whether the first token is a special token or not.
In [7]: tokenizer.pre_tokenizer.pre_tokenize_str("Xaaa")
Out[7]: [('Xaaa', (0, 4))]
as long as the prepend is not added twice, then it's working as expected I believe.
Please, consider the following cases and give your opinion, if it is a bug or not.
Base setup
Output:
{'a': 4, '<s>': 0, '<': 1, '>': 2, 'X': 3, 's': 5}
Alternatively, you may get the same tokenizer in the following way:
Case 1
First, let's check Metaspace.
Output:
[('X<s>a', (0, 4))]
This is ok. Pretokenizer added X in the beginning of the text.Output:
['<s>', 'X', 'a']
Why tokens order is reversed???Case 2
Now let's check our custom pre-tokenizer.
Output:
[('X<s>a', (0, 4))]
We get the same result as in Case 1. It's ok.Output:
['<s>', 'X', 'a']
We get the same result as in Case 1. Is it not ok?Case 3
Adding more than one character.
Output:
[('XXXXXXX<s>a', (0, 4))]
Ok.Output:
['<s>', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'a']
Now it's a long jump. Why?Case 4
Adding special token.
Output:
[('<s><s>a', (0, 4))]
Ok.Output:
['<s>', '<', 's', '>', 'a']
Why tokens get tokenized in a different way???Output:
[('<s><s><s><s>aaaaa', (0, 14))]
Ok.Output:
['<s>', '<s>', '<s>', '<', 's', '>', 'a', 'a', 'a', 'a', 'a']
Again, why different results for the same token?Case 5
Adding several special tokens.
Output:
[('<s><s><s><s><s>aaaaa', (0, 11))]
Ok.Output:
['<s>', '<s>', '<', 's', '>', '<', 's', '>', '<', 's', '>', 'a', 'a', 'a', 'a', 'a']
Is it a mess??Case 6
Let's check correctness of empty pretokenizer.
Output:
[('<s><s><s><s><s>aaaaa', (0, 20))]
Ok.Output:
['<s>', '<s>', '<s>', '<s>', '<s>', 'a', 'a', 'a', 'a', 'a']
Finally, it's ok.