Open neurothew opened 2 days ago
Hey! if you look at the tokens being generated:
In [5]: this_tokenizer.tokenize(this_sent)
Out[5]: ['▁I', '▁am', '▁going', '▁to', '▁school', '▁by', '▁bus', '.']
you might notice that the .
is "forcefully" separated from the word bus
.
The word ids should explain which word out of the initial ones the produced token belongs. Words are separated by `. Thus
.belongs to the
6thword 😉 It depends on the tokenizer because sometimes, the
.` is separated by the model itself and not the pre_tokenizer.
Hey! if you look at the tokens being generated:
In [5]: this_tokenizer.tokenize(this_sent) Out[5]: ['▁I', '▁am', '▁going', '▁to', '▁school', '▁by', '▁bus', '.']
you might notice that the
.
is "forcefully" separated from the wordbus
. The word ids should explain which word out of the initial ones the produced token belongs. Words are separated by`. Thus
.belongs to the
6thword 😉 It depends on the tokenizer because sometimes, the
.` is separated by the model itself and not the pre_tokenizer.
Hi @ArthurZucker , thanks for the explanation! I am encountering this problem because I am trying to find a way to identify the mapping between token ids and my target word within a sentence.
Can I understand that, for all tokenizer, word_ids()
are talking about those "words" that are separated with a space? Some speical case like .
, 's
might be handled differently depending on the tokenizer?
Yeah, I also think that you have some special options left to the user like clean_up_tokenization_spaces
(which are suppose to remove some spaces when decoding specifically for this case!)
System Info
transformers
version: 4.45.1Who can help?
@ArthurZucker @itaza
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I have been trying to use
word_ids()
to reveal what's being recognized as a word by the tokenizer.I am confused about the output as I am loading the tokenizer from
microsoft/deberta-v3-base
. See the following example.which gives
Then using
decode()
to see the 6th word:gives
So, the full stop following "bus" is regarded as the same "word" by
word_ids()
. Is this an expected behaviour? And what's the rationale behind it?I checked the same example with
bert-base-uncased
andFacebookAI/roberta-base
but there's no similar issue (full stop is recognized as the 7th word byword_ids()
from those two models).Expected behavior
See above.