microsoft / nlp-recipes

Natural Language Processing Best Practices & Examples
MIT License
6.38k stars 916 forks source link

[ASK] Adding custom entity labels to BERT NER #436

Open atakanokan opened 5 years ago

atakanokan commented 5 years ago

Description

Is it possible to finetune BERT NER on custom entity labels other than what is shown in https://github.com/microsoft/nlp/blob/master/examples/named_entity_recognition/ner_wikigold_bert.ipynb (Cell 4) :

Unique entity labels: 
['O', 'I-LOC', 'I-MISC', 'I-PER', 'I-ORG']

Other Comments

It seems possible but wanted to make sure. Procedure:

  1. Write custom dataset file like the one for Wikigold dataset: https://github.com/microsoft/nlp/blob/master/utils_nlp/dataset/wikigold.py
  2. Use this module to load entity labels, and train using the same code afterwards.
hlums commented 5 years ago

@atakanokan Yes. It's possible to have custom entity labels. It's like a muti-class classification problem, the model can handle any labels exist in the training data. If you have your dataset in a standard conll format like https://github.com/pritishuplavikar/Named-Entity-Recognition/blob/master/wikigold.conll.txt, you can use https://github.com/microsoft/nlp/blob/master/utils_nlp/dataset/ner_utils.py to preprocess your dataset as shown in wikigold.py

daden-ms commented 4 years ago

@atakanokan When creating the tags for your entities, please make sure they follow https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)

Kc2fresh commented 4 years ago

Do all Token Classification scenarios require the input data to be in the form of CONll? I want to use this for a 3 tag multi label classification over custom sentences, where the tags are mapped to chunks of tokens which together form a semantic representation, and not just a single token.

cryoff commented 3 years ago

@Kc2fresh have you been able to solve your task? (if I understand that correct, the label prediction for multi-token mentions)?