Confusion about the words returned by `word_ids()` in `deberta-v3-base`

neurothew commented 2 days ago

System Info

transformers version: 4.45.1
Platform: Linux-5.15.0-92-generic-x86_64-with-glibc2.35
Python version: 3.12.4
Huggingface_hub version: 0.26.1
Safetensors version: 0.4.3
Accelerate version: 0.32.1
Accelerate config: not found
PyTorch version (GPU?): 2.3.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no
Using GPU in script?: no
GPU type: NVIDIA RTX 6000 Ada Generation

Who can help?

@ArthurZucker @itaza

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I have been trying to use word_ids() to reveal what's being recognized as a word by the tokenizer.

I am confused about the output as I am loading the tokenizer from microsoft/deberta-v3-base. See the following example.

this_tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
this_sent = "I am going to school by bus."
this_encode = this_tokenizer.encode_plus(this_sent)

this_encode.word_ids()

which gives

[None, 0, 1, 2, 3, 4, 5, 6, 6, None]

Then using decode() to see the 6th word:

this_tokenizer.decode(this_encode.input_ids[7:9])

gives

bus.

So, the full stop following "bus" is regarded as the same "word" by word_ids(). Is this an expected behaviour? And what's the rationale behind it?

I checked the same example with bert-base-uncased and FacebookAI/roberta-base but there's no similar issue (full stop is recognized as the 7th word by word_ids() from those two models).

Expected behavior

See above.

ArthurZucker commented 2 days ago

Hey! if you look at the tokens being generated:

In [5]: this_tokenizer.tokenize(this_sent)
Out[5]: ['▁I', '▁am', '▁going', '▁to', '▁school', '▁by', '▁bus', '.']

you might notice that the . is "forcefully" separated from the word bus. The word ids should explain which word out of the initial ones the produced token belongs. Words are separated by `. Thus.belongs to the6thword 😉 It depends on the tokenizer because sometimes, the.` is separated by the model itself and not the pre_tokenizer.

neurothew commented 1 day ago

Hey! if you look at the tokens being generated:
In [5]: this_tokenizer.tokenize(this_sent)
Out[5]: ['▁I', '▁am', '▁going', '▁to', '▁school', '▁by', '▁bus', '.']
you might notice that the . is "forcefully" separated from the word bus. The word ids should explain which word out of the initial ones the produced token belongs. Words are separated by `. Thus.belongs to the6thword 😉 It depends on the tokenizer because sometimes, the.` is separated by the model itself and not the pre_tokenizer.

Hi @ArthurZucker , thanks for the explanation! I am encountering this problem because I am trying to find a way to identify the mapping between token ids and my target word within a sentence.

Can I understand that, for all tokenizer, word_ids() are talking about those "words" that are separated with a space? Some speical case like ., 's might be handled differently depending on the tokenizer?

ArthurZucker commented 10 hours ago

Yeah, I also think that you have some special options left to the user like clean_up_tokenization_spaces (which are suppose to remove some spaces when decoding specifically for this case!)

huggingface / transformers