google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.84k stars 9.56k forks source link

In ner task, do I need to add crf or just softmax in the end of the module? #776

Open g-jing opened 5 years ago

g-jing commented 5 years ago

I know in the original paper, they use softmax at the end of the module, but I wonder whether using crf will improve the performance ? Thanks

LiangYuHai commented 5 years ago

should be better

g-jing commented 5 years ago

@LiangYuHai Thanks for your answer. Did you try such comparison before ?

geyingli commented 5 years ago

Crf is the best model for ner tasks. I guess it will definitely improve the performance. I'm trying on that as well.

g-jing commented 5 years ago

@geyingli I think crf is a good decoder for sequence task. But bert is very powerful and already contains sequence information. Do you get a good result on crf now ? Besides, I find finetuning bert on NER task helps a lot.

lrs1353281004 commented 5 years ago

@geyingli I think crf is a good decoder for sequence task. But bert is very powerful and already contains sequence information. Do you get a good result on crf now ? Besides, I find finetuning bert on NER task helps a lot.

Hi! Could you share some details in your training ? Like learning rate , batch_size or other tricks. Cause I finetune bert on a Chinese NER dataset and didn't get better result than traditional bilstm-crf model. It would be helpful if you could share some details.Thanks~

g-jing commented 5 years ago

There is not much trick on bert finetuning. I could share some details if that helps. Batch size is 32, optimizer is SGD instead of Adam. It is finetune on NER task.

dsindex commented 5 years ago

in my case, i got the best result(92.1 ~ 92.23 f1 score by conlleval) for CoNLL(english) data given :

  1. batch size : 16
  2. learning rate : 2.00e-05
  3. bert model : large
  4. optimizer : AdamWeightDecayOptimizer
    • warmup : 2 epoch
    • exponential decay : 2000 steps
  5. hidden size of bilstm on the top of bert layer : 200
  6. crf on the top of bilstm : used
  7. bert dropout : 0.1
  8. other dropout : 0.1
  9. data shuffle : used

if no crf, the f1 scores are in range 91.2 ~ 91.8.

but 92.23p is not the average and still behind the score in the paper(BERT).

i think the ELMo + Glove embedding is more powerful for NER. (92.5 ~ 92.8 f1 score)

g-jing commented 5 years ago

@dsindex I noticed that you added LSTM layer on the top of BERT, do you think it performs better that without LSTM ? Thanks

dsindex commented 5 years ago

@RoderickGu

i think the difference is not that significant but better to use. my experiment shows that lstm gives 0.1~ 0.2% gain over bert only with fine-tune.

g-jing commented 5 years ago

@dsindex Thanks for your suggestions

anjani-dhrangadhariya commented 4 years ago

@dsindex I noticed that you added LSTM layer on the top of BERT, do you think it performs better that without LSTM ? Thanks

For all my NER tasks, LSTM on top of BERT consistently boosts the performance.

g-jing commented 4 years ago

@dsindex I noticed that you added LSTM layer on the top of BERT, do you think it performs better that without LSTM ? Thanks

For all my NER tasks, LSTM on top of BERT consistently boosts the performance.

That could be interesting results.

mianzhiwj commented 3 years ago

In my NER task, bert-crf got a better F1 score than bert-softmax, about 2% better.