Closed githubrandomuser2017 closed 3 years ago
Hi,
not a HuggingFace developer but I came across the same problem. I think this is this is due to the fact that the Tokenizer is truncating sequences longer than 64 so there is a mismatch in length between tags
and encodings
. This is also why it's fixed when you increase the max_lenght. Another reason may be that some characters in your sentences are not properly decoded because of wrong charset detection. I hope this helps.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I am also facing this issue. I am using custom dataset and haven't passed any max_length argument to the tokenizer.
Any idea how to fix this ? But same piece of code works well on W-NUT dataset
Hi, not a HuggingFace developer but I came across the same problem. I think this is this is due to the fact that the Tokenizer is truncating sequences longer than 64 so there is a mismatch in length between
tags
andencodings
. This is also why it's fixed when you increase the max_lenght. Another reason may be that some characters in your sentences are not properly decoded because of wrong charset detection. I hope this helps.
I observed that in the notebook shared by Hugging face for W-Nut dataset either, the tags and encodings length (for each record) are not same. So hoping that shouldn't be the issue.
@joeddav I am facing the same issue when switching to another dataset, what could be the problem? the behavior continues even with setting max_length=None
For me the error occurred using the example code in combination with a sentence piece tokenizer (e.g. XLM-RoBERTa). Switching to the updated code used in the run_ner.py script (https://github.com/huggingface/transformers/blob/ad072e852816cd32547504c2eb018995550b126a/examples/token-classification/run_ner.py) solved the issue for me.
I figured out the problem. A typical input instance has N
tokens and N
NER tags with a one-to-one correspondence. When you pass in the sentence to the tokenizer, it will add k
more tokens for either (1) subword tokens (e.g. ##ing
) or (2) special model-specific tokens (e.g. [CLS]
or [SEP]
. So now you have N+k
tokens and N
NER tags.
If you apply a max length truncation (e.g. 64
), then those N+k
tokens will get truncated to 64
, leaving an unpredictable mix of valid tokens and special tokens because both types of tokens may have been truncated. However, there are still N
NER tags which may not match up against valid tokens because the latter may have been truncated.
I fixed the problem by one of several approaches:
N+k
. However, this increase forces me to reduce my batch size due to GPU memory constraints.This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.
If you think this still needs to be addressed please comment on this thread.
I solved the issue by replacing
doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
encoded_labels.append(doc_enc_labels.tolist())
with
mask = (arr_offset[:, 0] == 0) & (arr_offset[:, 1] != 0)
doc_enc_labels[mask] = doc_labels[:np.sum(mask)]
encoded_labels.append(doc_enc_labels.tolist())
By this way, it will only map the first np.sum(mask)
true indices of doc_labels
in case of any indexing problem. I am a newbie 🤗 Transformers user, and I wonder if this solution may cause any problems.
I have this same issue but
mask = (arr_offset[:, 0] == 0) & (arr_offset[:, 1] != 0) doc_enc_labels[mask] = doc_labels[:np.sum(mask)] encoded_labels.append(doc_enc_labels.tolist())
did not work after the first encoded_labels run
Guys if the example has issues, why even put it out there and have us chaise our tails?
Hey! The example is currently being rewritten here by @stevhliu: https://github.com/huggingface/transformers/pull/13923
@LysandreJik Thanks for revisiting this problem. I feel that aligning tokens, token labels, and sub-world pieces is too complex for users of the library to implement themselves. Can you (HuggingFace) please provide some utility functions to make this task easier?
Hi @githubrandomuser2017, the examples we provide showcase exactly how to do that, for example here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/token-classification/run_ner.py#L370-L404
Does this utility function help you out?
PR #13923 was merged with the new version of this example. Closing this issue, feel free to reopen/comment if the issue arises again.
@LysandreJik
Hi @githubrandomuser2017, the examples we provide showcase exactly how to do that, for example here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/token-classification/run_ner.py#L370-L404
Does this utility function help you out?
I'll let other users chime in.
Even though this issue was closed, some old code still use the old version of encode_tags
method. From the example above, we found the hugging face page that explains the align_labels_with_tokens
method. It is a more robust approach to any unintentional words that are removed by fast tokenizers in the normalization process.
Environment info
transformers
version: 3.4Who can help
@stefan-it, @sgugger
Information
Model I am using (Bert, XLNet ...): DistilBERT
The problem arises when using:
The tasks I am working on is:
To reproduce
I'm trying to run the example code Advanced Guides --> Fine-tuning with custom datasets --> Token Classification with W-NUT Emerging Entities.
Steps to reproduce the behavior:
tokenizer
withmax_length=64
, which is typically my "best practice" choice. Note that if I setmax_length=None
, everything runs successfully.encode_tags()
on the WNUT data, I get a ValueErrorValueError: NumPy boolean array indexing assignment cannot assign 29 input values to the 24 output values where the mask is true