huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.99k stars 26.29k forks source link

Your example code for WNUT NER produces array indexing ValueError #7937

Closed githubrandomuser2017 closed 2 years ago

githubrandomuser2017 commented 3 years ago

Environment info

Who can help

@stefan-it, @sgugger

Information

Model I am using (Bert, XLNet ...): DistilBERT

The problem arises when using:

The tasks I am working on is:

To reproduce

I'm trying to run the example code Advanced Guides --> Fine-tuning with custom datasets --> Token Classification with W-NUT Emerging Entities.

Steps to reproduce the behavior:

  1. I already have a Google CoLab notebook with your code.
  2. I use the tokenizer with max_length=64, which is typically my "best practice" choice. Note that if I set max_length=None, everything runs successfully.
    max_length = 64
    encodings = tokenizer(texts, is_split_into_words=True, max_length=max_length, return_offsets_mapping=True, padding=True, truncation=True)
  3. When I run encode_tags() on the WNUT data, I get a ValueError
    
    labels = encode_tags(tags, encodings)
     11         # set labels whose first offset position is 0 and the second is not 0
    ---> 12         doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
     13         encoded_labels.append(doc_enc_labels.tolist())
     14 

ValueError: NumPy boolean array indexing assignment cannot assign 29 input values to the 24 output values where the mask is true



<!-- If you have code snippets, error messages, stack traces please provide them here as well.
     Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
     Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.-->

## Expected behavior

<!-- A clear and concise description of what you would expect to happen. -->
I expect that `encode_tags()` should return the correct IOB tag labels when I run your `Tokenizer` with a `max_length=64`.
fra-luc commented 3 years ago

Hi, not a HuggingFace developer but I came across the same problem. I think this is this is due to the fact that the Tokenizer is truncating sequences longer than 64 so there is a mismatch in length between tags and encodings. This is also why it's fixed when you increase the max_lenght. Another reason may be that some characters in your sentences are not properly decoded because of wrong charset detection. I hope this helps.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

chaituValKanO commented 3 years ago

I am also facing this issue. I am using custom dataset and haven't passed any max_length argument to the tokenizer.

Any idea how to fix this ? But same piece of code works well on W-NUT dataset

chaituValKanO commented 3 years ago

Hi, not a HuggingFace developer but I came across the same problem. I think this is this is due to the fact that the Tokenizer is truncating sequences longer than 64 so there is a mismatch in length between tags and encodings. This is also why it's fixed when you increase the max_lenght. Another reason may be that some characters in your sentences are not properly decoded because of wrong charset detection. I hope this helps.

I observed that in the notebook shared by Hugging face for W-Nut dataset either, the tags and encodings length (for each record) are not same. So hoping that shouldn't be the issue.

abdallah197 commented 3 years ago

@joeddav I am facing the same issue when switching to another dataset, what could be the problem? the behavior continues even with setting max_length=None

lwachowiak commented 3 years ago

For me the error occurred using the example code in combination with a sentence piece tokenizer (e.g. XLM-RoBERTa). Switching to the updated code used in the run_ner.py script (https://github.com/huggingface/transformers/blob/ad072e852816cd32547504c2eb018995550b126a/examples/token-classification/run_ner.py) solved the issue for me.

githubrandomuser2017 commented 3 years ago

I figured out the problem. A typical input instance has N tokens and N NER tags with a one-to-one correspondence. When you pass in the sentence to the tokenizer, it will add k more tokens for either (1) subword tokens (e.g. ##ing) or (2) special model-specific tokens (e.g. [CLS] or [SEP]. So now you have N+k tokens and N NER tags.

If you apply a max length truncation (e.g. 64), then those N+k tokens will get truncated to 64, leaving an unpredictable mix of valid tokens and special tokens because both types of tokens may have been truncated. However, there are still N NER tags which may not match up against valid tokens because the latter may have been truncated.

I fixed the problem by one of several approaches:

  1. Removing data instances that are problematically long. For example, I removed sentences which had more than 45 tokens. Using Pandas really help out here.
  2. Increasing the truncation length to, say, 128, or whatever number that's longer than any N+k. However, this increase forces me to reduce my batch size due to GPU memory constraints.
github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.

tolgayan commented 3 years ago

I solved the issue by replacing

doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
encoded_labels.append(doc_enc_labels.tolist())

with

mask = (arr_offset[:, 0] == 0) & (arr_offset[:, 1] != 0)
doc_enc_labels[mask] = doc_labels[:np.sum(mask)]
encoded_labels.append(doc_enc_labels.tolist())

By this way, it will only map the first np.sum(mask) true indices of doc_labels in case of any indexing problem. I am a newbie 🤗 Transformers user, and I wonder if this solution may cause any problems.

wzkariampuzha commented 3 years ago

I have this same issue but mask = (arr_offset[:, 0] == 0) & (arr_offset[:, 1] != 0) doc_enc_labels[mask] = doc_labels[:np.sum(mask)] encoded_labels.append(doc_enc_labels.tolist()) did not work after the first encoded_labels run

omargalal20084 commented 2 years ago

Guys if the example has issues, why even put it out there and have us chaise our tails?

LysandreJik commented 2 years ago

Hey! The example is currently being rewritten here by @stevhliu: https://github.com/huggingface/transformers/pull/13923

githubrandomuser2017 commented 2 years ago

@LysandreJik Thanks for revisiting this problem. I feel that aligning tokens, token labels, and sub-world pieces is too complex for users of the library to implement themselves. Can you (HuggingFace) please provide some utility functions to make this task easier?

LysandreJik commented 2 years ago

Hi @githubrandomuser2017, the examples we provide showcase exactly how to do that, for example here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/token-classification/run_ner.py#L370-L404

Does this utility function help you out?

LysandreJik commented 2 years ago

PR #13923 was merged with the new version of this example. Closing this issue, feel free to reopen/comment if the issue arises again.

githubrandomuser2017 commented 2 years ago

@LysandreJik

Hi @githubrandomuser2017, the examples we provide showcase exactly how to do that, for example here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/token-classification/run_ner.py#L370-L404

Does this utility function help you out?

I'll let other users chime in.

shihgianlee commented 2 months ago

Even though this issue was closed, some old code still use the old version of encode_tags method. From the example above, we found the hugging face page that explains the align_labels_with_tokens method. It is a more robust approach to any unintentional words that are removed by fast tokenizers in the normalization process.