How to use BERT for sequence labelling

Joerg99 commented 5 years ago

I'm planning to do NER with the Bert model. Unfortunately there is no sample provided for such a task. After inspecting the code for a while I have an okayish understanding of the model. From my understanding I have to set output_layer = model.get_sequence_output(). Next, I need a Processor for data input. Can I use an existing one (Xnli, Mnli, Mrpc, Cola) for my purposes or do I have to create a new one? I know here is described how the data has to be tokenized. Is there anything else I have to change? Would be great to hear from people with some experience with this.

dsindex commented 5 years ago

there are many repositories for bert ner by googling ‘bert ner topic’. you may refer to https://github.com/dsindex/BERT-BiLSTM-CRF-NER

forked version
estimator based

https://github.com/dsindex/etagger

you can test ner by glove, elmo, bert or combination of those embeddings.

Joerg99 commented 5 years ago

@dsindex Nice repo. Can you explain why you use an RNN? Are you just using bert Embeddings and input these into an RNN? Why don't you use a simple classifier(linear layer plus softmax) as used (at least I guess) in the BERT paper?

dsindex commented 5 years ago

@Joerg99

the reason is simple :)

i had many experiments using BERT.

experiments

nevertheless, i can't reproduce the f1 score in the BERT paper(a simple classifier). so, i just add BiLSTM + CRF layers on the top of BERT layer. it yields better performance(a bit).

Joerg99 commented 5 years ago

@dsindex Haha ok I see :) I'd like to use it with the multilingual model. Do you think it's easy to adapt?

dsindex commented 5 years ago

@Joerg99

well~ in my experiences(it is not general case), the f1 score of the multilingual model(multi_cased_L-12_H-768_A-12) was about 1% bellow than the pre-trained model(BERT base) from Korean corpus. (i used https://github.com/dsindex/etagger for it.)

congchan commented 5 years ago

@Joerg99

the reason is simple :)

i had many experiments using BERT.

nevertheless, i can't reproduce the f1 score in the BERT paper(a simple classifier). so, i just add BiLSTM + CRF layers on the top of BERT layer. it yields better performance(a bit).

Hi, have you experimented with BERT+CRF only?

dsindex commented 5 years ago

@congchan

Yes~ as you see the picture, the 7th row stands for 'BERT+CRF only'.

'the 11th column' == 'CRF used' head

Joerg99 commented 5 years ago

@dsindex You built on the code from macanv which provides serving. Any recommendation on how to use your version for serving?

dsindex commented 5 years ago

@Joerg99

above experiments were based on https://github.com/dsindex/etagger

there is an web api for serving. (inference/python/www)

but, it is not built by tf estimator but tf low level api like ‘placeholder’. i think it is more flexible to manage.

Joerg99 commented 5 years ago

@dsindex I'm trying to push the code to work with the simple estimator serving. I don't need much flexibility so in general this should suffice :) So far I added this snippet and I'm able to export the variables but get an Error (AttributeError: module 'tensorflow.contrib.tpu.python.ops.tpu_ops' has no attribute 'tpu_replicate_metadata') and the .pb file is not exported:

def serving_input_receiver_fn():
    max_seq_len = 180
    input_ids = tf.placeholder(dtype=tf.int32, shape=[None, max_seq_len], name="pl_in_ids")
    input_mask= tf.placeholder(dtype=tf.int32, shape=[None, max_seq_len], name="pl_in_mask")
    segment_ids = tf.placeholder(dtype=tf.int32, shape=[None, max_seq_len], name="pl_seg_ids")
    label_ids = tf.placeholder(dtype=tf.int32, shape=[None, max_seq_len], name="pl_label_ids")
    receiver_tensors = {"input_ids": input_ids, "input_mask": input_mask, "segment_ids": segment_ids, "label_ids": label_ids}
    features = {"input_ids": input_ids, "input_mask": input_mask, "segment_ids": segment_ids, "label_ids": label_ids}
    return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)

estimator.export_saved_model("estimator_export_saved_model", serving_input_receiver_fn)

Maybe it's easier to adopt your code from the etagger.

stevezheng23 commented 5 years ago

Here is the BERT extension project which includes a BERT-NER implementation, https://github.com/stevezheng23/bert_extension_tf.

This BERT extension project is currently importing google-research/bert repo as its submodule.

jind11 commented 5 years ago

Here is the BERT extension project which includes a BERT-NER implementation, https://github.com/stevezheng23/bert_extension_tf.

This BERT extension project is currently importing google-research/bert repo as its submodule.

Hi, in your experiments, it seems that we can never reproduce the 92.4 F1 score reported in the paper for bert-base model, right? I have searched a lot of implementations available online, none of them can even be higher than 92.0. How was the original paper implemented? Thanks!

stevezheng23 commented 5 years ago

@jind11 Yes, the 5-run average F1 score didn't reach 92.4 or 92.8, but the best run can sometimes reach 92.5+. I think there might some implementation details I'm missing. Also, I'll share more experiment results in https://github.com/stevezheng23/bert_extension_tf later

jind11 commented 5 years ago

@stevezheng23 Wow, you mean for some random seeds, you can get 92.4 F1? Could you share that run configuration? And if convenient, could you share the checkpoint model parameters? Your help is greatly appreciated. Thanks!

stevezheng23 commented 5 years ago

Only for BERT-large, not for BERT-base

For BERT-base, I didn't reach 92.0+ F1 score for either 5-run average or any top runs
For BERT-large, I did reach 92.5+ F1 score for the best run, but not for 5-run average

I'll definitely share the experiment result and config settings in https://github.com/stevezheng23/bert_extension_tf early next week. As for the BERT-large model checkpoint, which is over 1.2G, do you still need it? If so, I can try to share it with you via cloud share storage

jind11 commented 5 years ago

@stevezheng23 yes, if convenient, please share it, thanks!

amankhandelia commented 5 years ago

Hi @stevezheng23, @dsindex, sorry to crash in to this conversation, I am also working on sequence labeling using BERT, but I am getting this curious error of NaN while calculating gradient, did you people ever came across that error, if yes, can you share how to resolve the same.

I sifted through both of your codebases, so I had this question, while calculating loss can we ignore loss coming from PAD tokens and corresponding labels, will that make the model any better or any faster to train? Since you all have not done that, is there any reason to not do so?

dsindex commented 5 years ago

@amankhandelia

if you got NaN error, it may be due to the learning rate. (just guessing)

and i am not sure your question about ignoring loss from PAD tokens. in my case, i calculate the loss value without PAD area.

amankhandelia commented 5 years ago

@dsindex

Thanks a lot for that quick answer.

Let make myself clear what I mean by the loss from the PAD tokens.

So in your codebase while processing the sequence you are padding to max_seq_length. But when you are calculating loss for each token, you not excluding the PAD tokens(or tokens with id 0) at the end of each sequence. So when I say ignoring loss from the PAD tokens, I mean to disregard that portion of the loss which is contributed by these PAD tokens at the end of the sequence (by not calculating loss for those particluar tokens). I hope this make things a bit more clear.

Based on the above, do you have anything to add, also can you elaborate on what do you mean by "without PAD area"?.

Thanks again for your time and support.

dsindex commented 5 years ago

@amankhandelia

the codebase you point out is forked version. i thought you are talking about dsindex/etagger.

https://github.com/dsindex/etagger/blob/master/model.py#L550

here, i use ‘sequence_lengths’ for ignoring PAD area when computing loss value. (masking)

stevezheng23 commented 5 years ago

Yes, when calculating the loss, we usually use a position mask to mask out activation from [PAD] positions.

Something like this, masked_result = result result_mask + MIN_FLOAT (1

result_mask)

On Tue, Jul 9, 2019 at 4:54 AM अमन खण्डेलिया (Aman Khandelia) < notifications@github.com> wrote:

Hi @stevezheng23 https://github.com/stevezheng23, @dsindex https://github.com/dsindex, sorry to crash in to this conversation, I am also working on sequence labeling using BERT, but I am getting this curious error of NaN while calculating gradient, did you people ever came across that error, if yes, can you share how to resolve the same.

I sifted through both of your codebases, so I had this question, while calculating loss can we ignore loss coming from PAD tokens and corresponding labels, will that make the model any better or any faster to train? Since you all have not done that, is there any reason to not do so?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-research/bert/issues/569?email_source=notifications&email_token=ABYXYMZMOZ3F7EUC2F2AUKTP6R37TA5CNFSM4HE4M732YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZQA4YY#issuecomment-509611619, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXYM3HOLW476HPB5CVYYDP6R37TANCNFSM4HE4M73Q .

-- Best, Mingzhi

congchan commented 5 years ago

@dsindex

the codebase you point out is forked version. i thought you are talking about dsindex/etagger.

https://github.com/dsindex/etagger/blob/master/model.py#L550

here, i use ‘sequence_lengths’ for ignoring PAD area when computing loss value. (masking)

Any idea to add weight to crf loss？ Sometimes I need to upsample some labels. Thanks

dsindex commented 5 years ago

@congchan

i have not try to apply weights other than 1s. i think that if we add weight to specific labels, the result would be biased. but, it may be possible adding weight for each word like P(type | word) which comes from external resources.

google-research / bert

How to use BERT for sequence labelling #569