huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.72k stars 26.71k forks source link

word_ids() returned by RoBERTa Tokenizer behaves inconsistently for alphanumeric tokens like '18th' #12665

Closed hos-arafat closed 3 years ago

hos-arafat commented 3 years ago

Environment info

Who can help

Information

Model I am using is RoBERTa.

The problem arises when using:

A simple script that uses RoBERTa to do NER.

The tasks I am working on is:

I am doing Named Entity Recognition (NER) on the conll2003 dataset from the datasets library.

As such, I am using RoBERTa + a classification head on top to classify each token in the sequence.

Moreover, when the RoBERTa Tokenizer splits a word into many sub-tokens, I pass the entire sentence through RoBERTa then, using the word_ids returned by Tokenizer.batch_encode_plus, pass only the contextual embeddings associated with the first sub-token of each word into my final classification head. (otherwise, the len(prediction) > len(label)).

Detailed code of this can be found in the final Section below.

The Problem

The problem is with the word_ids() returned by batch_encode_plus() for sentences that have alphanumeric tokens like '18th' or '1980s'. Where the word_ids() will be as follows:

['During', 'Ä the', 'Ä 1980', 's', 'Ä ,', 'Ä life', 'Ä was', 'Ä weird'] # No 'Ä ' before 's', as expected, but
word_ids = [None, 0, 1, 2, 3, 4, 5, 6, 7, None] # This causes a problem ! I expect it to be 
word_ids = [None, 0, 1, 2, 2....

['An', 'Ä 18', 'th', 'Ä century', 'Ä poet'] # No 'Ä ' before 'th', as expected, but
word_ids = [None, 0, 1, 2, 3, 4, None, None, None, None] # This causes a problem ! I expect it to be 
word_ids = [None, 0, 1, 1....

Notice that the token '1980s' was split into ['Ä 1980', 's'] but the word_ids did NOT indicate this, as what is returned is [None, 0, 1, 2, 3, 4, 5, 6, 7, None]. Which indicates that the sub-token 's' is its own word (and NOT a sub-token of the word '1980s')

To reproduce

Steps to reproduce the behavior:

  1. Import and Initialize the RoBERTa Tokenizer (Fast)
    
    from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)

2. ````batch_encode_plus```` sentences that have alphanumeric tokens like ````'18th'```` and ````'1980s'````:
```python
sentences = ["During the 1980s , life was something else", "An 18th century poet"]

e = tokenizer.batch_encode_plus(sentences, return_tensors='pt', padding=True)
  1. Print and inspect the word_ids(i)
print(tokenizer.tokenize(sentences[0]))
print(e.word_ids(0))

print(tokenizer.tokenize(sentences[1]))
print(e.word_ids(1))

Expected behavior

The word_ids should correctly indicate whenever tokens such as '1980s' and '18th' are split:

['<s>',   'An', 'Ä 18', 'th', 'Ä century', 'Ä poet', '</s>']
[None,     0,      1,    1,        2,         3,   None]

Detailed Code


input_sentence = ["He lives joyfully"]
label          = ["O",  "O",   "O"]

tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)
model = AutoModel.from_pretrained("roberta-base")

encoded_x = tokenizer.batch_encode_plus(input_sentence, return_tensors='pt', padding=True)
# The input sentence now becomes ["<s>", "Ä He", "Ä lives", "Ä joy", "fully", "</s>"]

contextual_embeddings = model(encoded_x.input_ids).last_hidden_state  # [1, 6, 768] tensor.

# I need to pass a [1, 3, 768] tensor into my final classification head
# So, I wrote a function that takes as input the word_ids
# and returns a list of the first sub-token of each word (dropping <s> and </s>)
# Function NOT included here for brevity. Same function works perfectly for BERT
my_function( [None, 0, 1, 2, 2, None] ) -> [0, 1, 2]

first_subtoken = torch.LongTensor([0, 1, 2])

embeddings_of_interest = contextual_embeddings[:, first_subtoken, :]  #  [1, 3, 768] tensor
LysandreJik commented 3 years ago

Thanks for the very helpful reproducer! @n1t0, @SaulLu, could you take a look? Thank you!

SaulLu commented 3 years ago

Thank you for providing a code snippets @hos-arafat !

If I understand your request correctly, you would like to retrieve the index of the word to which each token belongs.

If this is your request, you have two ways of doing this - @n1t0 don't hesitate to correct me - :

  1. By letting your tokenizer automatically guess what a word is This is the option you use in the example you showed. In this case, the tokenizer uses the tokenizer's pre-tokenization component to define what a word is. On your example, you can see this breakdown by doing:
    sentences = ["During the 1980s , life was something else", "An 18th century poet"]
    tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)
    print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(sentences[0]))
    tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(sentences[1])

    And you will have as output:

    [('During', (0, 6)),
    ('Ä the', (6, 10)),
    ('Ä 1980', (10, 15)),
    ('s', (15, 16)),
    ('Ä ,', (16, 18)),
    ('Ä life', (18, 23)),
    ('Ä was', (23, 27)),
    ('Ä something', (27, 37)),
    ('Ä else', (37, 42))]
    [('An', (0, 2)),
    ('Ä 18', (2, 5)),
    ('th', (5, 7)),
    ('Ä century', (7, 15)),
    ('Ä poet', (15, 20))]

Indeed, there you can see that the ByteLevel pre-tokenization separates the numeric characters from the others.

  1. By specifying before the tokenization the tokens which must belong to the same word If ever the separation proposed by the pre-tokenizer does not suit you, you have the possibility of specifying yourself the list of "words" you wish by giving to the tokenizer a list of words instead of a sentence. The only constraint with the tokenizer you use is that you must set the add_prefix_space argument to True. On your example, if for example you want to consider that words are separated by spaces, you could do:
    
    sentences_splited_into_words = [sentence.split(" ") for sentence in sentences]
    tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True, add_prefix_space=True)

e = tokenizer.batch_encode_plus( sentences_splited_into_words, return_tensors="pt", padding=True, is_split_into_words=True )

print(e.tokens(0)) print(e.word_ids(0))

print(e.tokens(1)) print(e.word_ids(1))

Output:

['', 'Ä During', 'Ä the', 'Ä 1980', 's', 'Ä ,', 'Ä life', 'Ä was', 'Ä something', 'Ä else', ''] [None, 0, 1, 2, 2, 3, 4, 5, 6, 7, None]

['', 'Ä An', 'Ä 18', 'th', 'Ä century', 'Ä poet', '', '', '', '', ''] [None, 0, 1, 1, 2, 3, None, None, None, None, None]



I hope this answers your question and if it doesn't, don't hesitate to tell me! :smile: 
hos-arafat commented 3 years ago

Apologies for the late response, had to study and sit for an exam yesterday (aced it!). Thank you for the quick response, and glad the reproducer was helpful ! @LysandreJik @SaulLu .

That's exactly right @SaulLu , I am interested in retrieving the index of every sub-token and to what "full" word it belongs to. For example:

['An', 'Ä 18', 'th', 'Ä century', 'Ä poet']  # the tokenizer splits '18' and 'th' so len = 5

# This sentence will have labels: 
[ 'O', 'O', 'O', 'O'] # len = 4

# Using the word_ids(), I get the index of the first sub-token of each word 
# and create the following list:
['An', 'Ä 18', 'Ä century', 'Ä poet']  # I DROP the sub-token 'th' so len = label_len = 4

# When the word_ids() is incorrect (does NOT tell me what tokens were split)
# I end up doing loss(predictions, labels)
# which throws an error cuz len(predictions) > len(labels)

Thank you for the solutions you offered ! They are both helpful. I can do two things:

  1. Instead of using the word_ids() I can use the output of / tuples returned by pre_tokenize_str() in order to figure out what words were split into many sub-tokens and only take the first subtoken

  2. Since the word_ids() are returned correctly when I split the string, I can keep using them and just split my sentences based on whitespaces using split() and add the argument is_split_into_words=True to batch_encode_plus()

I am wondering why word_ids() is returned incorrectly as I highlighted in the reproducer though. Will try to investigate the GPT2Tokenizer class and tokenize() and see if I can spot something and contribute a fix! Would love to give back to this awesome library!

Thanks again for your help!

SaulLu commented 3 years ago

Glad it helped :hugs: and great that your exam went well!

I am wondering why word_ids() is returned incorrectly as I highlighted in the reproducer though. Will try to investigate the GPT2Tokenizer class and tokenize() and see if I can spot something and contribute a fix! Would love to give back to this awesome library!

That is really nice of you! Personally, I think that the word_ids tokenizer method behaves in the desired way. However, I think we could be more specific in documenting the word_ids method in the :hugs: transformers library so that it gives as much information as the underlying function used about the role of the pre-tokenizer which is in the :hugs: tokenizers library and is documented here. Would you like to propose a reformulation of the documentation in the transformers library :slightly_smiling_face: ?

In order to make it easier to read my answer, I put a copy of the two documentations below.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.