kpe / bert-for-tf2

A Keras TensorFlow 2.0 implementation of BERT, ALBERT and adapter-BERT.
https://github.com/kpe/bert-for-tf2
MIT License
803 stars 193 forks source link

Named-entity recognition #42

Open mvandenbout opened 4 years ago

mvandenbout commented 4 years ago

How would you approach named-entity recognition with this library?

atreyasha commented 4 years ago

I am working on a similar sequence tagging task for argument candidate identification. Essentially BERT or ALBERT would perform the encoding aspect of the raw input. Then, you would need a layer on top of BERT|ALBERT to decode the representations to the desired target.

I would essentially follow this example here: https://github.com/kpe/bert-for-tf2/blob/master/examples/gpu_movie_reviews.ipynb

Under create_model, you would need to modify the layers after the BERT|ALBERT layer to map to your output sequence dimension. I will probably do this task in another repo and can post some results soon.

@kpe you mentioned in #30 to ignore the activations of the padding in the output layer, would you also suggest doing this for a sequence tagging task? If so, how would you propose doing this in the output layer?

Also, thank you for this awesome repo. Minor issue though: under NEWS on the readme, I think the first entry should be 6th Jan 2020. Just a minor thing, no biggie :)

harrystuart commented 4 years ago

Any update on NER tasks with this library?

yingchengsun commented 2 years ago

If there is a NER example with this library, that will be very helpful!

ptamas88 commented 2 years ago

Hi, As I managed to use this library for NER task i am happy to share my experiences. Sorry, but I can't share the whole code, but trying to explain the key parts.

1) The input text is tokenized by the tokenizer module and padded to a specified max lenght (in my case 200 tokens at max) 2) For each token the output tags are transformed into a one-hot vector and if the tokenizer broke up one word into multiple tokens then I used the belonging tag for the first token and [MASK] for the remaining part of the original word 3) So I have X sentences in the trainign set, then the input shape is (X,200) hence 200 is the padded lenght of each sentences. In this case the output shape is (X,200,NUMBER_OF_TAGS). NUMBER_OF_TAGS is the number of your entity types, depends of whether you use BIOE, or just BIO, and here you add the special tokens: [CLS], [PAD], [MASK]. In my case here are the tags: ['B-ORG', 'I-ORG', 'B-MISC', 'I-MISC', 'B-LOC', 'I-LOC', 'B-PER', 'I-PER', 'O', '[CLS]', '[MASK]', '[PAD]']. This way my shapes are (X,200) and (X,200,12) 4) load the Bert model the same way as in the calssification example but here we will use a different model architecture for the remaining layers, hence it is not just a classification. This is basically the example codes of the packages description with a little tweak:

bert_layer = bert_tf2.BertModelLayer.from_params(bert_params, name="bert")

input = tf.keras.layers.Input(shape=(200))
output = bert_layer(input)
output = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(units=12, activation='softmax'))(output)
model = tf.keras.models.Model(inputs=input, outputs=output)

model.build(input_shape=(200))

bert_layer.apply_adapter_freeze()
bert_layer.embeddings_layer.trainable=False 

The magic here is the TimeDistributed wrapper layer. My results: After just 1 epoch on 29k trainign sentences: loss: 0.0227 - categorical_accuracy: 0.9933 - val_loss: 0.0042 - val_categorical_accuracy: 0.9988

So basically, that's it folks :)