Open Joerg99 opened 5 years ago
there are many repositories for bert ner by googling ‘bert ner topic’. you may refer to https://github.com/dsindex/BERT-BiLSTM-CRF-NER
https://github.com/dsindex/etagger
@dsindex Nice repo. Can you explain why you use an RNN? Are you just using bert Embeddings and input these into an RNN? Why don't you use a simple classifier(linear layer plus softmax) as used (at least I guess) in the BERT paper?
@Joerg99
the reason is simple :)
i had many experiments using BERT.
nevertheless, i can't reproduce the f1 score in the BERT paper(a simple classifier).
so, i just add BiLSTM + CRF
layers on the top of BERT layer.
it yields better performance(a bit).
@dsindex Haha ok I see :) I'd like to use it with the multilingual model. Do you think it's easy to adapt?
@Joerg99
well~ in my experiences(it is not general case),
the f1 score of the multilingual model(multi_cased_L-12_H-768_A-12) was about 1% bellow than the pre-trained model(BERT base) from Korean corpus.
(i used https://github.com/dsindex/etagger
for it.)
@Joerg99
the reason is simple :)
i had many experiments using BERT.
nevertheless, i can't reproduce the f1 score in the BERT paper(a simple classifier). so, i just add
BiLSTM + CRF
layers on the top of BERT layer. it yields better performance(a bit).
Hi, have you experimented with BERT+CRF only?
@congchan
Yes~ as you see the picture, the 7th row stands for 'BERT+CRF only'.
'the 11th column' == 'CRF used'
@dsindex You built on the code from macanv which provides serving. Any recommendation on how to use your version for serving?
@Joerg99
above experiments were based on https://github.com/dsindex/etagger
there is an web api for serving. (inference/python/www)
but, it is not built by tf estimator but tf low level api like ‘placeholder’. i think it is more flexible to manage.
@dsindex I'm trying to push the code to work with the simple estimator serving. I don't need much flexibility so in general this should suffice :) So far I added this snippet and I'm able to export the variables but get an Error (AttributeError: module 'tensorflow.contrib.tpu.python.ops.tpu_ops' has no attribute 'tpu_replicate_metadata') and the .pb file is not exported:
def serving_input_receiver_fn():
max_seq_len = 180
input_ids = tf.placeholder(dtype=tf.int32, shape=[None, max_seq_len], name="pl_in_ids")
input_mask= tf.placeholder(dtype=tf.int32, shape=[None, max_seq_len], name="pl_in_mask")
segment_ids = tf.placeholder(dtype=tf.int32, shape=[None, max_seq_len], name="pl_seg_ids")
label_ids = tf.placeholder(dtype=tf.int32, shape=[None, max_seq_len], name="pl_label_ids")
receiver_tensors = {"input_ids": input_ids, "input_mask": input_mask, "segment_ids": segment_ids, "label_ids": label_ids}
features = {"input_ids": input_ids, "input_mask": input_mask, "segment_ids": segment_ids, "label_ids": label_ids}
return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)
estimator.export_saved_model("estimator_export_saved_model", serving_input_receiver_fn)
Maybe it's easier to adopt your code from the etagger.
Here is the BERT extension project which includes a BERT-NER implementation, https://github.com/stevezheng23/bert_extension_tf.
This BERT extension project is currently importing google-research/bert repo as its submodule.
Here is the BERT extension project which includes a BERT-NER implementation, https://github.com/stevezheng23/bert_extension_tf.
This BERT extension project is currently importing google-research/bert repo as its submodule.
Hi, in your experiments, it seems that we can never reproduce the 92.4 F1 score reported in the paper for bert-base model, right? I have searched a lot of implementations available online, none of them can even be higher than 92.0. How was the original paper implemented? Thanks!
@jind11 Yes, the 5-run average F1 score didn't reach 92.4 or 92.8, but the best run can sometimes reach 92.5+. I think there might some implementation details I'm missing. Also, I'll share more experiment results in https://github.com/stevezheng23/bert_extension_tf later
@stevezheng23 Wow, you mean for some random seeds, you can get 92.4 F1? Could you share that run configuration? And if convenient, could you share the checkpoint model parameters? Your help is greatly appreciated. Thanks!
Only for BERT-large, not for BERT-base
I'll definitely share the experiment result and config settings in https://github.com/stevezheng23/bert_extension_tf early next week. As for the BERT-large model checkpoint, which is over 1.2G, do you still need it? If so, I can try to share it with you via cloud share storage
@stevezheng23 yes, if convenient, please share it, thanks!
Hi @stevezheng23, @dsindex, sorry to crash in to this conversation, I am also working on sequence labeling using BERT, but I am getting this curious error of NaN while calculating gradient, did you people ever came across that error, if yes, can you share how to resolve the same.
I sifted through both of your codebases, so I had this question, while calculating loss can we ignore loss coming from PAD tokens and corresponding labels, will that make the model any better or any faster to train? Since you all have not done that, is there any reason to not do so?
@amankhandelia
if you got NaN error, it may be due to the learning rate. (just guessing)
and i am not sure your question about ignoring loss from PAD tokens. in my case, i calculate the loss value without PAD area.
@dsindex
Thanks a lot for that quick answer.
Let make myself clear what I mean by the loss from the PAD tokens.
So in your codebase while processing the sequence you are padding to max_seq_length. But when you are calculating loss for each token, you not excluding the PAD tokens(or tokens with id 0) at the end of each sequence. So when I say ignoring loss from the PAD tokens, I mean to disregard that portion of the loss which is contributed by these PAD tokens at the end of the sequence (by not calculating loss for those particluar tokens). I hope this make things a bit more clear.
Based on the above, do you have anything to add, also can you elaborate on what do you mean by "without PAD area"?.
Thanks again for your time and support.
@amankhandelia
the codebase you point out is forked version. i thought you are talking about dsindex/etagger.
https://github.com/dsindex/etagger/blob/master/model.py#L550
here, i use ‘sequence_lengths’ for ignoring PAD area when computing loss value. (masking)
Yes, when calculating the loss, we usually use a position mask to mask out activation from [PAD] positions.
Something like this, masked_result = result result_mask + MIN_FLOAT (1
On Tue, Jul 9, 2019 at 4:54 AM अमन खण्डेलिया (Aman Khandelia) < notifications@github.com> wrote:
Hi @stevezheng23 https://github.com/stevezheng23, @dsindex https://github.com/dsindex, sorry to crash in to this conversation, I am also working on sequence labeling using BERT, but I am getting this curious error of NaN while calculating gradient, did you people ever came across that error, if yes, can you share how to resolve the same.
I sifted through both of your codebases, so I had this question, while calculating loss can we ignore loss coming from PAD tokens and corresponding labels, will that make the model any better or any faster to train? Since you all have not done that, is there any reason to not do so?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-research/bert/issues/569?email_source=notifications&email_token=ABYXYMZMOZ3F7EUC2F2AUKTP6R37TA5CNFSM4HE4M732YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZQA4YY#issuecomment-509611619, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXYM3HOLW476HPB5CVYYDP6R37TANCNFSM4HE4M73Q .
-- Best, Mingzhi
@dsindex
the codebase you point out is forked version. i thought you are talking about dsindex/etagger.
https://github.com/dsindex/etagger/blob/master/model.py#L550
here, i use ‘sequence_lengths’ for ignoring PAD area when computing loss value. (masking)
Any idea to add weight to crf loss? Sometimes I need to upsample some labels. Thanks
@congchan
i have not try to apply weights other than 1s. i think that if we add weight to specific labels, the result would be biased. but, it may be possible adding weight for each word like P(type | word) which comes from external resources.
I'm planning to do NER with the Bert model. Unfortunately there is no sample provided for such a task. After inspecting the code for a while I have an okayish understanding of the model. From my understanding I have to set
output_layer = model.get_sequence_output()
. Next, I need a Processor for data input. Can I use an existing one (Xnli, Mnli, Mrpc, Cola) for my purposes or do I have to create a new one? I know here is described how the data has to be tokenized. Is there anything else I have to change? Would be great to hear from people with some experience with this.