kermitt2 / delft

a Deep Learning Framework for Text https://delft.readthedocs.io/
Apache License 2.0
388 stars 64 forks source link

NER with bert #21

Open byzhang opened 5 years ago

byzhang commented 5 years ago

Do you have a plan to reproduce the BERT NER model? I tried, but with Bert_base, the best micro-avg Test F1 on CoNLL-2003 is 91.37, while the reported in the paper is 92.4.

kermitt2 commented 5 years ago

Hello @byzhang ! Yes I plan to reproduce BERT NER when I will find the time (also FLAIR).

Did you use the fine-tuning approach or the "ELMo-like" feature-based approach they describe in their paper in section 5.4?

In the NER evaluation of their paper, it was unclear if they used or not the CoNLL 2003 dev section for training, which can make a quite big difference in the final f-score (but not as big as what you mention).

byzhang commented 5 years ago

I used the fine-tuning approach, and the dev set is used for hyper tuning and early stop only

kermitt2 commented 4 years ago

see ongoing work on PR #78

kermitt2 commented 4 years ago

The best run I could get with BERT-base-en (cased) is 91.68 for CoNLL 2003 NER test set, tuning with dev set, training only with train set - but I added a CRF activation layer for fine tuning instead of the default softmax (CRF brings around +0.3 to f-score). So this is the same as you.

Average over 10 training+eval, this gives 91.20 - so very far from the reported 92.4.

As discussed there, the results reported in the paper for NER are likely token-level scores, not entity-level - very misleading of course.

ghaddarAbs commented 4 years ago

In order to reproduce the conll score reported in BERT paper (92.4 bert-base and 92.8 bert-large) one trick is to apply a truecaser on article titles (all upper case sentences) as preprocessing step for conll train/dev/test. This can be simply done with the following method.

#https://github.com/daltonfury42/truecase
#pip install truecase
import truecase
import re

# original tokens
#['FULL', 'FEES', '1.875', 'REOFFER', '99.32', 'SPREAD', '+20', 'BP']

def truecase_sentence(tokens):
   word_lst = [(w, idx) for idx, w in enumerate(tokens) if all(c.isalpha() for c in w)]
   lst = [w for w, _ in word_lst if re.match(r'\b[A-Z\.\-]+\b', w)]

   if len(lst) and len(lst) == len(word_lst):
       parts = truecase.get_true_case(' '.join(lst)).split()

       # the trucaser have its own tokenization ...
       # skip if the number of word dosen't match
       if len(parts) != len(word_lst): return tokens

       for (w, idx), nw in zip(word_lst, parts):
           tokens[idx] = nw

# truecased tokens
#['Full', 'fees', '1.875', 'Reoffer', '99.32', 'spread', '+20', 'BP']

Also, i found useful to use : very small learning rate (5e-6) \ large batch size (128) \ high epoch num (>40).

With these configurations and preprocessing, I was able to reach 92.8 with bert-large.

kermitt2 commented 4 years ago

Hello @ghaddarAbs !

Thank you for your message and spending the time to share your experiments to reproduce the BERT reported results.

Sorry it took me some time to come back to this.

I've tried to see the impact of the truecase pre-processing with bert-base-en (cased), so having in mind the reported 92.4 f-score (I am using bert-base because I don't have easily the GPU to use bert-large). Below I didn't touch the hyper-parameters:

bert-base bert-base+CRF BidLSTM-CRF (glove)
no truecase 90.77 (90.43-91.15) 91.20 (90.78-91.68) 90.75 (90.39-91.35)
with truecase - 91.42 (91.22-91.74) 90.77 (90.43-91.15)

The scores are averaged over 10 train/runs with worst-best scores in the parentheses. So the gain alone of the pre-processing is significant (+0.22) but not big. Apparently the truecase has no impact on BidLSTM-CRF, but an impact with BERT. I guess it's because in BERT, the vocabulary is case-sensitive and do not consider extra casing variants from the 30522 sub-tokens, while BidLSTM has a dedicated char input channel which will deal very well with generalization of casing (which also explains why adding "casing" features in the BidLSTM-CRF has zero effect).

In term of evaluation, I think we are not comparing really anymore just NER algorithm here, we are also evaluating the true case tool, it's what people call usually "using external knowledge".

I've started to experiment with your indicated hyper-parameters, but it takes a lot of time (> 40 epoch with learning rate so low, it's really different from the usual 3-6 max epoch usually selected with BERT, it takes days and days with 10 runs :/).

Regarding your results, may I ask you the following questions:

On my side, when I add all these "tricks", I am not very far from the reported score (but still 0.3-0.4 missing). But, from the reproducibility point of view, according to the original BERT paper they are not using any of them (thus the 91.20 f-score versus the reported 92.4). From the evaluation point of view, I mus say using these tricks makes the evaluation not anymore comparable with other reported numbers, or we have to add them too to the other algorithms.

ghaddarAbs commented 4 years ago

@kermitt2 ...

I used GPU with 32 GB for these experiments.

To answer your 3 questions:

My own intuition is that the authors of BERT applied true casing on CoNLL-2003 for fine-tune NER. It was the only way for me to reproduce their results, but I don't know actually if they have done it or not. Of course, if true casing is applied than the results are not comparable with previous works.

wangxinyu0922 commented 3 years ago

I trained the NER model with bert-base-cased and truecase as well and find that in can get 91.72 F1 score on average, but it is still far from the reported score in BERT paper

pinesnow72 commented 3 years ago

@ghaddarAbs, @kermitt2

I tried truecase with bert-base-cased and it gave a little improvement but the test f1 still was limited below 92.0. The BERT paper says that they used maximal document context for NER. That means, I think, that they used left/right sentence context for predicting target sentence. I tried this document context and could get around 92.4 test f1.

BCWang93 commented 2 years ago

@pinesnow72 hi, how do you use the document context in conll2003? can you share some method?Thanks!

pinesnow72 commented 2 years ago

@pinesnow72 hi, how do you use the document context in conll2003? can you share some method?Thanks!

@BCWang93 For each sentence, I added previous and next sentence tokens (sub-tokens for BERT) before and after the target sentence respectively to maximally fill the max-len of each sample. I put the target sentence in the middle and so roughly same size of left and right sentence tokens were added as context. Of course, CLS and SEP were inserted at the beginning and sentence boundaries, respectively. This context-added samples will be passed to BERT encoder. But the output labels should be predicted only on the target sentence of each sample. To do this, before the classification layer, I implemented and added TargetSelection layer, which gets as inputs the BERT output and target sentence token indices and selects only target sentence encoding from the context-added BERT output by using tf.gather() method.

BCWang93 commented 2 years ago

@ghaddarAbs, @kermitt2

I tried truecase with bert-base-cased and it gave a little improvement but the test f1 still was limited below 92.0. The BERT paper says that they used maximal document context for NER. That means, I think, that they used left/right sentence context for predicting target sentence. I tried this document context and could get around 92.4 test f1.

@pinesnow72 ,hi, can you share some code you process data in this method?Thanks!