huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.93k stars 26.79k forks source link

Confusion about the words returned by `word_ids()` in `deberta-v3-base` #34309

Open neurothew opened 2 days ago

neurothew commented 2 days ago

System Info

Who can help?

@ArthurZucker @itaza

Information

Tasks

Reproduction

I have been trying to use word_ids() to reveal what's being recognized as a word by the tokenizer.

I am confused about the output as I am loading the tokenizer from microsoft/deberta-v3-base. See the following example.

this_tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
this_sent = "I am going to school by bus."
this_encode = this_tokenizer.encode_plus(this_sent)

this_encode.word_ids()

which gives

[None, 0, 1, 2, 3, 4, 5, 6, 6, None]

Then using decode() to see the 6th word:

this_tokenizer.decode(this_encode.input_ids[7:9])

gives

bus.

So, the full stop following "bus" is regarded as the same "word" by word_ids(). Is this an expected behaviour? And what's the rationale behind it?

I checked the same example with bert-base-uncased and FacebookAI/roberta-base but there's no similar issue (full stop is recognized as the 7th word by word_ids() from those two models).

Expected behavior

See above.

ArthurZucker commented 2 days ago

Hey! if you look at the tokens being generated:

In [5]: this_tokenizer.tokenize(this_sent)
Out[5]: ['▁I', '▁am', '▁going', '▁to', '▁school', '▁by', '▁bus', '.']

you might notice that the . is "forcefully" separated from the word bus. The word ids should explain which word out of the initial ones the produced token belongs. Words are separated by `. Thus.belongs to the6thword 😉 It depends on the tokenizer because sometimes, the.` is separated by the model itself and not the pre_tokenizer.

neurothew commented 1 day ago

Hey! if you look at the tokens being generated:

In [5]: this_tokenizer.tokenize(this_sent)
Out[5]: ['▁I', '▁am', '▁going', '▁to', '▁school', '▁by', '▁bus', '.']

you might notice that the . is "forcefully" separated from the word bus. The word ids should explain which word out of the initial ones the produced token belongs. Words are separated by `. Thus.belongs to the6thword 😉 It depends on the tokenizer because sometimes, the.` is separated by the model itself and not the pre_tokenizer.

Hi @ArthurZucker , thanks for the explanation! I am encountering this problem because I am trying to find a way to identify the mapping between token ids and my target word within a sentence.

Can I understand that, for all tokenizer, word_ids() are talking about those "words" that are separated with a space? Some speical case like ., 's might be handled differently depending on the tokenizer?

ArthurZucker commented 10 hours ago

Yeah, I also think that you have some special options left to the user like clean_up_tokenization_spaces (which are suppose to remove some spaces when decoding specifically for this case!)