Closed hos-arafat closed 3 years ago
Thanks for the very helpful reproducer! @n1t0, @SaulLu, could you take a look? Thank you!
Thank you for providing a code snippets @hos-arafat !
If I understand your request correctly, you would like to retrieve the index of the word to which each token belongs.
If this is your request, you have two ways of doing this - @n1t0 don't hesitate to correct me - :
sentences = ["During the 1980s , life was something else", "An 18th century poet"]
tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(sentences[0]))
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(sentences[1])
And you will have as output:
[('During', (0, 6)),
('Ä the', (6, 10)),
('Ä 1980', (10, 15)),
('s', (15, 16)),
('Ä ,', (16, 18)),
('Ä life', (18, 23)),
('Ä was', (23, 27)),
('Ä something', (27, 37)),
('Ä else', (37, 42))]
[('An', (0, 2)),
('Ä 18', (2, 5)),
('th', (5, 7)),
('Ä century', (7, 15)),
('Ä poet', (15, 20))]
Indeed, there you can see that the ByteLevel pre-tokenization separates the numeric characters from the others.
add_prefix_space
argument to True
. On your example, if for example you want to consider that words are separated by spaces, you could do:
sentences_splited_into_words = [sentence.split(" ") for sentence in sentences]
tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True, add_prefix_space=True)
e = tokenizer.batch_encode_plus( sentences_splited_into_words, return_tensors="pt", padding=True, is_split_into_words=True )
print(e.tokens(0)) print(e.word_ids(0))
print(e.tokens(1)) print(e.word_ids(1))
Output:
['', 'Ä During', 'Ä the', 'Ä 1980', 's', 'Ä ,', 'Ä life', 'Ä was', 'Ä something', 'Ä else', '']
[None, 0, 1, 2, 2, 3, 4, 5, 6, 7, None]
['', 'Ä An', 'Ä 18', 'th', 'Ä century', 'Ä poet', '', '
I hope this answers your question and if it doesn't, don't hesitate to tell me! :smile:
Apologies for the late response, had to study and sit for an exam yesterday (aced it!). Thank you for the quick response, and glad the reproducer was helpful ! @LysandreJik @SaulLu .
That's exactly right @SaulLu , I am interested in retrieving the index of every sub-token and to what "full" word it belongs to. For example:
['An', 'Ä 18', 'th', 'Ä century', 'Ä poet'] # the tokenizer splits '18' and 'th' so len = 5
# This sentence will have labels:
[ 'O', 'O', 'O', 'O'] # len = 4
# Using the word_ids(), I get the index of the first sub-token of each word
# and create the following list:
['An', 'Ä 18', 'Ä century', 'Ä poet'] # I DROP the sub-token 'th' so len = label_len = 4
# When the word_ids() is incorrect (does NOT tell me what tokens were split)
# I end up doing loss(predictions, labels)
# which throws an error cuz len(predictions) > len(labels)
Thank you for the solutions you offered ! They are both helpful. I can do two things:
Instead of using the word_ids()
I can use the output of / tuples returned by pre_tokenize_str()
in order to figure out what words were split into many sub-tokens and only take the first subtoken
Since the word_ids()
are returned correctly when I split the string, I can keep using them and just split my sentences based on whitespaces using split()
and add the argument is_split_into_words=True
to batch_encode_plus()
I am wondering why word_ids()
is returned incorrectly as I highlighted in the reproducer though. Will try to investigate the GPT2Tokenizer
class and tokenize()
and see if I can spot something and contribute a fix! Would love to give back to this awesome library!
Thanks again for your help!
Glad it helped :hugs: and great that your exam went well!
I am wondering why word_ids() is returned incorrectly as I highlighted in the reproducer though. Will try to investigate the GPT2Tokenizer class and tokenize() and see if I can spot something and contribute a fix! Would love to give back to this awesome library!
That is really nice of you! Personally, I think that the word_ids
tokenizer method behaves in the desired way. However, I think we could be more specific in documenting the word_ids
method in the :hugs: transformers library so that it gives as much information as the underlying function used about the role of the pre-tokenizer which is in the :hugs: tokenizers library and is documented here. Would you like to propose a reformulation of the documentation in the transformers library :slightly_smiling_face: ?
In order to make it easier to read my answer, I put a copy of the two documentations below.
word_ids
method in the :hugs: transformers:
def word_ids(self, batch_index: int = 0) -> List[Optional[int]]:
"""
Return a list mapping the tokens to their actual word in the initial sentence for a fast tokenizer.
Args:
batch_index (:obj:`int`, `optional`, defaults to 0): The index to access in the batch.
Returns:
:obj:`List[Optional[int]]`: A list indicating the word corresponding to each token. Special tokens added by
the tokenizer are mapped to :obj:`None` and other tokens are mapped to the index of their corresponding
word (several tokens will be mapped to the same word index if they are parts of that word).
"""
word_ids
method in the :hugs: tokenizers:
def word_ids(self):
"""
The generated word indices.
They represent the index of the word associated to each token.
When the input is pre-tokenized, they correspond to the ID of the given input label,
otherwise they correspond to the words indices as defined by the
:class:`~tokenizers.pre_tokenizers.PreTokenizer` that was used.
For special tokens and such (any token that was generated from something that was
not part of the input), the output is :obj:`None`
Returns:
A :obj:`List` of :obj:`Optional[int]`: A list of optional word index.
"""
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: 4.8.2Who can help
Information
Model I am using is RoBERTa.
The problem arises when using:
A simple script that uses RoBERTa to do NER.
The tasks I am working on is:
I am doing Named Entity Recognition (NER) on the
conll2003
dataset from thedatasets
library.As such, I am using RoBERTa + a classification head on top to classify each token in the sequence.
Moreover, when the RoBERTa Tokenizer splits a word into many sub-tokens, I pass the entire sentence through RoBERTa then, using the
word_ids
returned byTokenizer.batch_encode_plus
, pass only the contextual embeddings associated with the first sub-token of each word into my final classification head. (otherwise, thelen(prediction) > len(label)
).Detailed code of this can be found in the final Section below.
The Problem
The problem is with the
word_ids()
returned bybatch_encode_plus()
for sentences that have alphanumeric tokens like'18th'
or'1980s'
. Where theword_ids()
will be as follows:Notice that the token
'1980s'
was split into['Ä 1980', 's']
but theword_ids
did NOT indicate this, as what is returned is[None, 0, 1, 2, 3, 4, 5, 6, 7, None]
. Which indicates that the sub-token's'
is its own word (and NOT a sub-token of the word'1980s'
)To reproduce
Steps to reproduce the behavior:
tokenizer = AutoTokenizer.from_pretrained("roberta-base", use_fast=True)
word_ids(i)
Expected behavior
The
word_ids
should correctly indicate whenever tokens such as'1980s'
and'18th'
are split:Detailed Code