word_ids() returned by RoBERTa Tokenizer behaves inconsistently for alphanumeric tokens like '18th'

Environment info

transformers version: 4.8.2
Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.9.0+cu102 (True)
Tensorflow version (GPU?): 2.5.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

tokenizers: @LysandreJik (Specifically the RoBERTa / GPT tokenizer @patrickvonplaten)

Information

Model I am using is RoBERTa.

The problem arises when using:

[ ] my own modified scripts:

A simple script that uses RoBERTa to do NER.

The tasks I am working on is:

[ ] my own task or dataset:

I am doing Named Entity Recognition (NER) on the conll2003 dataset from the datasets library.

As such, I am using RoBERTa + a classification head on top to classify each token in the sequence.

Moreover, when the RoBERTa Tokenizer splits a word into many sub-tokens, I pass the entire sentence through RoBERTa then, using the word_ids returned by Tokenizer.batch_encode_plus, pass only the contextual embeddings associated with the first sub-token of each word into my final classification head. (otherwise, the len(prediction) > len(label)).

Detailed code of this can be found in the final Section below.

The Problem

The problem is with the word_ids() returned by batch_encode_plus() for sentences that have alphanumeric tokens like '18th' or '1980s'. Where the word_ids() will be as follows:

['During', 'Ġthe', 'Ġ1980', 's', 'Ġ,', 'Ġlife', 'Ġwas', 'Ġweird'] # No 'Ġ' before 's', as expected, but
word_ids = [None, 0, 1, 2, 3, 4, 5, 6, 7, None] # This causes a problem ! I expect it to be 
word_ids = [None, 0, 1, 2, 2....

['An', 'Ġ18', 'th', 'Ġcentury', 'Ġpoet'] # No 'Ġ' before 'th', as expected, but
word_ids = [None, 0, 1, 2, 3, 4, None, None, None, None] # This causes a problem ! I expect it to be 
word_ids = [None, 0, 1, 1....

Notice that the token '1980s' was split into ['Ġ1980', 's'] but the word_ids did NOT indicate this, as what is returned is [None, 0, 1, 2, 3, 4, 5, 6, 7, None]. Which indicates that the sub-token 's' is its own word (and NOT a sub-token of the word '1980s')

To reproduce

Steps to reproduce the behavior:

Import and Initialize the RoBERTa Tokenizer (Fast)
```
from transformers import AutoTokenizer
```

tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)

2. ````batch_encode_plus```` sentences that have alphanumeric tokens like ````'18th'```` and ````'1980s'````:
```python
sentences = ["During the 1980s , life was something else", "An 18th century poet"]

e = tokenizer.batch_encode_plus(sentences, return_tensors='pt', padding=True)

Print and inspect the word_ids(i)

print(tokenizer.tokenize(sentences[0]))
print(e.word_ids(0))

print(tokenizer.tokenize(sentences[1]))
print(e.word_ids(1))

Expected behavior

The word_ids should correctly indicate whenever tokens such as '1980s' and '18th' are split:

['<s>',   'An', 'Ġ18', 'th', 'Ġcentury', 'Ġpoet', '</s>']
[None,     0,      1,    1,        2,         3,   None]

Detailed Code


input_sentence = ["He lives joyfully"]
label          = ["O",  "O",   "O"]

tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)
model = AutoModel.from_pretrained("roberta-base")

encoded_x = tokenizer.batch_encode_plus(input_sentence, return_tensors='pt', padding=True)
# The input sentence now becomes ["<s>", "ĠHe", "Ġlives", "Ġjoy", "fully", "</s>"]

contextual_embeddings = model(encoded_x.input_ids).last_hidden_state  # [1, 6, 768] tensor.

# I need to pass a [1, 3, 768] tensor into my final classification head
# So, I wrote a function that takes as input the word_ids
# and returns a list of the first sub-token of each word (dropping <s> and </s>)
# Function NOT included here for brevity. Same function works perfectly for BERT
my_function( [None, 0, 1, 2, 2, None] ) -> [0, 1, 2]

first_subtoken = torch.LongTensor([0, 1, 2])

embeddings_of_interest = contextual_embeddings[:, first_subtoken, :]  #  [1, 3, 768] tensor

Thanks for the very helpful reproducer! @n1t0, @SaulLu, could you take a look? Thank you!

Thank you for providing a code snippets @hos-arafat !

If I understand your request correctly, you would like to retrieve the index of the word to which each token belongs.

If this is your request, you have two ways of doing this - @n1t0 don't hesitate to correct me - :

By letting your tokenizer automatically guess what a word is This is the option you use in the example you showed. In this case, the tokenizer uses the tokenizer's pre-tokenization component to define what a word is. On your example, you can see this breakdown by doing:

sentences = ["During the 1980s , life was something else", "An 18th century poet"]
tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(sentences[0]))
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(sentences[1])

And you will have as output:

[('During', (0, 6)),
('Ġthe', (6, 10)),
('Ġ1980', (10, 15)),
('s', (15, 16)),
('Ġ,', (16, 18)),
('Ġlife', (18, 23)),
('Ġwas', (23, 27)),
('Ġsomething', (27, 37)),
('Ġelse', (37, 42))]

[('An', (0, 2)),
('Ġ18', (2, 5)),
('th', (5, 7)),
('Ġcentury', (7, 15)),
('Ġpoet', (15, 20))]

Indeed, there you can see that the ByteLevel pre-tokenization separates the numeric characters from the others.

By specifying before the tokenization the tokens which must belong to the same word If ever the separation proposed by the pre-tokenizer does not suit you, you have the possibility of specifying yourself the list of "words" you wish by giving to the tokenizer a list of words instead of a sentence. The only constraint with the tokenizer you use is that you must set the add_prefix_space argument to True. On your example, if for example you want to consider that words are separated by spaces, you could do:
```
sentences_splited_into_words = [sentence.split(" ") for sentence in sentences]
tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True, add_prefix_space=True)
```

e = tokenizer.batch_encode_plus( sentences_splited_into_words, return_tensors="pt", padding=True, is_split_into_words=True )

print(e.tokens(0)) print(e.word_ids(0))

print(e.tokens(1)) print(e.word_ids(1))

Output:

['~~', 'ĠDuring', 'Ġthe', 'Ġ1980', 's', 'Ġ,', 'Ġlife', 'Ġwas', 'Ġsomething', 'Ġelse', '~~'] [None, 0, 1, 2, 2, 3, 4, 5, 6, 7, None]

['~~', 'ĠAn', 'Ġ18', 'th', 'Ġcentury', 'Ġpoet', '~~', '', '', '', ''] [None, 0, 1, 1, 2, 3, None, None, None, None, None]



I hope this answers your question and if it doesn't, don't hesitate to tell me! :smile:

Apologies for the late response, had to study and sit for an exam yesterday (aced it!). Thank you for the quick response, and glad the reproducer was helpful ! @LysandreJik @SaulLu .

That's exactly right @SaulLu , I am interested in retrieving the index of every sub-token and to what "full" word it belongs to. For example:

['An', 'Ġ18', 'th', 'Ġcentury', 'Ġpoet']  # the tokenizer splits '18' and 'th' so len = 5

# This sentence will have labels: 
[ 'O', 'O', 'O', 'O'] # len = 4

# Using the word_ids(), I get the index of the first sub-token of each word 
# and create the following list:
['An', 'Ġ18', 'Ġcentury', 'Ġpoet']  # I DROP the sub-token 'th' so len = label_len = 4

# When the word_ids() is incorrect (does NOT tell me what tokens were split)
# I end up doing loss(predictions, labels)
# which throws an error cuz len(predictions) > len(labels)

Thank you for the solutions you offered ! They are both helpful. I can do two things:

Instead of using the word_ids() I can use the output of / tuples returned by pre_tokenize_str() in order to figure out what words were split into many sub-tokens and only take the first subtoken
Since the word_ids() are returned correctly when I split the string, I can keep using them and just split my sentences based on whitespaces using split() and add the argument is_split_into_words=True to batch_encode_plus()

I am wondering why word_ids() is returned incorrectly as I highlighted in the reproducer though. Will try to investigate the GPT2Tokenizer class and tokenize() and see if I can spot something and contribute a fix! Would love to give back to this awesome library!

Thanks again for your help!

Glad it helped :hugs: and great that your exam went well!

I am wondering why word_ids() is returned incorrectly as I highlighted in the reproducer though. Will try to investigate the GPT2Tokenizer class and tokenize() and see if I can spot something and contribute a fix! Would love to give back to this awesome library!

That is really nice of you! Personally, I think that the word_ids tokenizer method behaves in the desired way. However, I think we could be more specific in documenting the word_ids method in the :hugs: transformers library so that it gives as much information as the underlying function used about the role of the pre-tokenizer which is in the :hugs: tokenizers library and is documented here. Would you like to propose a reformulation of the documentation in the transformers library :slightly_smiling_face: ?

In order to make it easier to read my answer, I put a copy of the two documentations below.

word_ids method in the :hugs: transformers:

def word_ids(self, batch_index: int = 0) -> List[Optional[int]]:
    """
    Return a list mapping the tokens to their actual word in the initial sentence for a fast tokenizer.

    Args:
        batch_index (:obj:`int`, `optional`, defaults to 0): The index to access in the batch.

    Returns:
        :obj:`List[Optional[int]]`: A list indicating the word corresponding to each token. Special tokens added by
        the tokenizer are mapped to :obj:`None` and other tokens are mapped to the index of their corresponding
        word (several tokens will be mapped to the same word index if they are parts of that word).
    """

word_ids method in the :hugs: tokenizers:

def word_ids(self):
    """
    The generated word indices.

    They represent the index of the word associated to each token.
    When the input is pre-tokenized, they correspond to the ID of the given input label,
    otherwise they correspond to the words indices as defined by the
    :class:`~tokenizers.pre_tokenizers.PreTokenizer` that was used.

    For special tokens and such (any token that was generated from something that was
    not part of the input), the output is :obj:`None`

    Returns:
        A :obj:`List` of :obj:`Optional[int]`: A list of optional word index.
    """

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers