PyTorch Huggingface BERT-NLP for Named Entity Recognition

AshwinAmbal commented 5 years ago

I have been using your PyTorch implementation of Google’s BERT by HuggingFace for the MADE 1.0 dataset for quite some time now. Up until last time (11-Feb), I had been using the library and getting an F-Score of 0.81 for my Named Entity Recognition task by Fine Tuning the model. But this week when I ran the exact same code which had compiled and run earlier, it threw an error when executing this statement:

input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts], maxlen=MAX_LEN, dtype=”long”, truncating=”post”, padding=”post”)

ValueError: Token indices sequence length is longer than the specified maximum sequence length for this BERT model (632 > 512). Running this sequence through BERT will result in indexing errors

The full code is available in this colab notebook.

To get around this error I modified the above statement to the one below by taking the first 512 tokens of any sequence and made the necessary changes to add the index of [SEP] to the end of the truncated/padded sequence as required by BERT.

input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt[:512]) for txt in tokenized_texts], maxlen=MAX_LEN, dtype=”long”, truncating=”post”, padding=”post”)

The result shouldn’t have changed because I am only considering the first 512 tokens in the sequence and later truncating to 75 as my (MAX_LEN=75) but my F-Score has dropped to 0.40 and my precision to 0.27 while the Recall remains the same (0.85). I am unable to share the dataset as I have signed a confidentiality clause but I can assure all the preprocessing as required by BERT has been done and all extended tokens like (Johanson –> Johan ##son) have been tagged with X and replaced later after the prediction as said in the BERT Paper.

Has anyone else faced a similar issue or can elaborate on what might be the issue or what changes the PyTorch (Huggingface) has done on their end recently?

AshwinAmbal commented 5 years ago

I've found a fix to get around this. Running the same code with pytorch-pretrained-bert==0.4.0 solves the issue and the performance is restored to normal. There's something messing with the model performance in BERT Tokenizer or BERTForTokenClassification in the new update which is affecting the model performance. Hoping that HuggingFace clears this up soon. :) Thanks.

jplehmann commented 5 years ago

There's something messing with the model performance in BERT Tokenizer or BERTForTokenClassification in the new update which is affecting the model performance. Hoping that HuggingFace clears this up soon. :)

Sounds like the issue should remain open?

AshwinAmbal commented 5 years ago

Oh. I didn't know I closed the issue. Let me reopen it now.

Thanks.

On Tue, 5 Mar, 2019, 10:57 AM John Lehmann, notifications@github.com wrote:

There's something messing with the model performance in BERT Tokenizer or BERTForTokenClassification in the new update which is affecting the model performance. Hoping that HuggingFace clears this up soon. :)

Sounds like the issue should remain open?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/huggingface/pytorch-pretrained-BERT/issues/328#issuecomment-469814216, or mute the thread https://github.com/notifications/unsubscribe-auth/AcM_oJw9xYJ3ppFG_egEJTrMQq22ERKhks5vTr4wgaJpZM4bVcDQ .

AshwinAmbal commented 5 years ago

Sorry about that. Didn't realise I closed the issue. Reopened it now. :)

thomwolf commented 5 years ago

Seems strange that the tokenization changed.

So you were only having sequence with less than 512 tokens before and now some sequences are longer?

Without having access to your dataset I can't really help you but if you can compare the tokenized sequences in your dataset with pytorch-pretrained-bert==0.4.0 versus sequences tokenized with the current pytorch-pretrained-bert==0.6.1 to identify a sequence which is tokenized differently it could help find the root of the issue.

Then maybe you can just post some part of a sequence or example which is tokenized differently without breaching your confidentiality clause?

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

savindi-wijenayaka commented 5 years ago

I had the same issue when trying to use it with Flair for text classification. Can I know the root cause of this issue? Does this mean that my text part in the dataset is too long?

thomwolf commented 5 years ago

Yes, BERT only accepts inputs smaller or equal to 512 tokens.

Ezekiel25c17 commented 5 years ago

Seems strange that the tokenization changed.

So you were only having sequence with less than 512 tokens before and now some sequences are longer?

Without having access to your dataset I can't really help you but if you can compare the tokenized sequences in your dataset with pytorch-pretrained-bert==0.4.0 versus sequences tokenized with the current pytorch-pretrained-bert==0.6.1 to identify a sequence which is tokenized differently it could help find the root of the issue.

Then maybe you can just post some part of a sequence or example which is tokenized differently without breaching your confidentiality clause?

I think I found a little bug in tokenization.py that may be related to this issue. I was facing a similar problem that using the newest version leads to a huge accuracy drop (from 88% to 22%) in a very common multi-class news title classification task. Using pytorch-pretrained-bert==0.4.0 was actually a workaround so I did a comparison of the tokenization logs of these two versions.

the main problem was that many tokens have different ids during training and evaluation. Compared to 0.4.0, the newest version has an additional function that saves the vocabulary to the output_dir/vocab.txt after training and then loads this generated vocab.txt instead during evaluation. In my case, this generated vocab.txt differs from the original one because in https://github.com/huggingface/pytorch-pretrained-BERT/blob/3763f8944dc3fef8afb0c525a2ced8a04889c14f/pytorch_pretrained_bert/tokenization.py#L65 the tokenizer deletes all the trailing spaces. This actually strips different tokens, say a normal space and a non-break space into an identical empty token "". After changing this line to "token = token.rstrip("¥n") ", I was able to reproduce the expected accuracy using the newest version

thomwolf commented 5 years ago

@Ezekiel25c17 I'm a bit surprised that training spaces would be important in the vocabulary so I would like to investigate this deeper.

Can you give me the reference of the following elements you were using in your tests:

the python version,
versions of pytorch-pretrained-bert
the pretrained model,
the vocabulary (probably same as the model I guess),
the example script.

So I can reproduce the behavior

Ezekiel25c17 commented 5 years ago

@thomwolf yes sure,

Python 3.6.5
pytorch_pretrained_bert=0.6.2
pretrained model
- download link
- vocab.txt and pytorch_model.bin are contained
- trained using Japanese Wikipedia
example script: run_classifier.py with a little modification to suit for a multi-class classification
also, you may need to comment out this line in tokenization.py because Japanese contains many Chinese characters https://github.com/huggingface/pytorch-pretrained-BERT/blob/3763f8944dc3fef8afb0c525a2ced8a04889c14f/pytorch_pretrained_bert/tokenization.py#L235

Maybe the point can be explained using the following example:

let's say we have a bert_model/vocab.txt contains only four tokens: 'a', 'b ', 'c', 'b'
then after loading it during training, vocab_train = {'a':0, 'c':2, 'b':3}
the saved output_dir/vocab.txt will be something like: 'a', 'c', 'b'
finally when loading output_dir/vocab.txt during evaluation, vocab_eval = {'a':0, 'c':1, 'b':2}

AshwinAmbal commented 5 years ago

@Ezekiel25c17 Shuffled indices would make sense for the accuracy to drop. @thomwolf I had longer sequences before too but in pytorch-pretrained-bert==0.4.0 the statement input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts], maxlen=MAX_LEN, dtype=”long”, truncating=”post”, padding=”post”) did not have a very strict implementation but in 0.6.1 it threw a Value Error which I overcame by truncating the sequences to 512 before feeding it to the "tokenizer.convert_tokens_to_ids(txt)" function. Either way, I was using only the first 75 tokens of the sentence (MAX_LEN=75). So it didn't matter to me. When I was re-running the same code this was the only statement that threw an error which was why I thought there must have been a change in this functionality in the update.

IINemo commented 5 years ago

The issue is still there (current master or 1.0.0. release). Looks like 'BertForTokenClassification' is broken since 0.4.0 . With current version any trained model produces very low scores (dozens of percentage points lower than 0.4.0).

IINemo commented 5 years ago

Sorry for misleading comment. BertForTokenClassification is fine, I just did not use the proper padding label (do not use 'O' label for padding, use a separate label, e.g. '[PAD]').

akashsara commented 5 years ago

@IINemo if you are using an attention mask, then wouldn't the label for the padding not matter at all?

IINemo commented 4 years ago

Hi,

If you use “O” in versions of pytorch pretrained bert >= 0.5.0, the problem happens because loss on padded tokens is ignored, then any wrong output of the model on padded tokens will not be penalized and the model will learn wrong signal for labels “O”.

The full fixed version of the code that does sequence tagging with BERT and newest version of pytorch pretrained bert is here: https://github.com/IINemo/bert_sequence_tagger

There is a class SequenceTaggerBert that works with tokenized sequences (e.g., nltk tokenizer) and does all the necessary preprocessing under the hood.

Best

On Wed, Sep 11, 2019 at 9:50 AM Akash Saravanan notifications@github.com wrote:

@IINemo https://github.com/IINemo if you are using an attention mask, then wouldn't the label for the padding not matter at all?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/pytorch-transformers/issues/328?email_source=notifications&email_token=AFAVG3P4WSZIAUGWKBZJDXTQJCIJRA5CNFSM4G2VYDIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6NOZYQ#issuecomment-530246882, or mute the thread https://github.com/notifications/unsubscribe-auth/AFAVG3NTSU4CCAZYWSACMPLQJCIJRANCNFSM4G2VYDIA .

Swty13 commented 4 years ago

Yes, BERT only accepts inputs smaller or equal to 512 tokens.

Hi , I wanted to trained BERT for text more than 512 tokens ,I can not truncate text to 512 as there will be loss of information in that case.Could you please help how can I handle this or any other suggestion to build customized NER for my usecase using BERT.

Thanks

huggingface / transformers

PyTorch Huggingface BERT-NLP for Named Entity Recognition #328