Cannot find the NER Model

kemalaraz commented 4 years ago

It seems huggingface repository contains only the base model, I couldn't find the model and tokenizer related to the model for named entity recognition. Where can I find the trained NER model and if it is not too much to ask how can I load and use it easily?

kemalaraz commented 4 years ago

I am trying to replicate your NER results with hugginface take your base model as pre-trained and using BertForTokenClassification then train the model for NER but it is not converging. Can you elaborate me on that?

stefan-it commented 4 years ago

Hi @kemalaraz ,

yes, there are no fine-tuned models stored on the model hub at the moment, only the ones with a normal lm head.

For fine-tuning I used FARM, with corresponding configuration files in the ./config folder of this repo.

Could you just paste the fine-tuning command that you've used with the run_ner.py script? Maybe you should adjust the number of epochs (for NER I used 10 epochs) :)

kemalaraz commented 4 years ago

Hello again @stefan-it , Thanks for a quick response, I wrote a big code chunk for it which is below:

# -*- coding: utf-8 -*-
import torch
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertConfig, AdamW
from transformers import BertForTokenClassification
from preprocess import create_bert_data
import argparse
from seqeval.metrics import f1_score
from tqdm import tqdm, trange
import numpy as np
from transformers import get_linear_schedule_with_warmup

EPOCHS = 10
BATCH_SIZE = 16
MAX_GRAD_NORM = 1.0

parser = argparse.ArgumentParser()
parser.add_argument("-t", "--trainpath", required = True, help = "Path to the train raw data")
parser.add_argument("-v", "--valpath", required = True, help = "Path to the val raw data")
parser.add_argument("-f", "--finetune", required = True, help = "Fine tuning option if false just the classifier layer is trained")
args = vars(parser.parse_args())

if args["finetune"] == "True":
    fine_tune = True
if args["finetune"] == "False":
    fine_tune = False

assert fine_tune is True or fine_tune is False, "Fine-tune must be a boolean value but given {}".format(args["finetune"])

#Tokenizer
tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")

#Accuracy on a token level (balanced accuracy)
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=2).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

#Set up if the script will run on cpu or gpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()

train_input, train_labels, uniq_ent, ent_idx, train_attention_masks = create_bert_data("train.txt", train = True)
val_input, val_labels, val_attention_masks = create_bert_data(args["valpath"], train = False)
one_training_step = len(train_input) / BATCH_SIZE
train_input = torch.tensor(train_input)
train_labels = torch.tensor(train_labels)
train_attention_masks = torch.tensor(train_attention_masks)
val_input = torch.tensor(val_input)
val_labels = torch.tensor(val_labels)
val_attention_masks = torch.tensor(val_attention_masks)

#Define the DataLoader and shuffle the data at training time and test time pass them to SequentialSampler
train_dataset = TensorDataset(train_input, train_attention_masks, train_labels)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler = train_sampler, batch_size = BATCH_SIZE)

val_dataset = TensorDataset(val_input, val_attention_masks, val_labels)
val_sampler = RandomSampler(val_dataset)
val_dataloader = DataLoader(val_dataset, sampler = val_sampler, batch_size = BATCH_SIZE)

#Write unique entities and their ids into a file
with open("entity_idx.txt", "w") as foo:
    foo.write(str(ent_idx))

print(ent_idx)

#Load the model
model = BertForTokenClassification.from_pretrained("dbmdz/bert-base-turkish-cased", num_labels = len(ent_idx))

#For gpu
model.cuda()
num_training_steps = int(one_training_step/BATCH_SIZE)*EPOCHS
num_warmup_steps = int(num_training_steps * 0.4)
#Initilize the optimizer
optimizer = AdamW(model.parameters(), lr = 5e-5, correct_bias=False)  #To reproduce BertAdam specific behavior set correct_bias=False
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  #PyTorch schedule

#For printing under progress bars
description_train = tqdm(total=0, position=1, bar_format='{desc}')
description_val = tqdm(total=0, position=1, bar_format='{desc}')

train_losses = []

# TRAINING LOOP
for epoch in range(EPOCHS):
    print("{:d}/{:d} EPOCH".format(epoch, EPOCHS))
    model.train()
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    # Training progress bar
    pbar_train = tqdm(total = len(train_input), leave = False)

    #TRAINING STEP
    for step, batch in enumerate(train_dataloader):

        # add batch to gpu
        input_batch, input_mask_batch, labels_batch = [b.to(device) for b in batch]
        # forward pass
        loss,_ = model(input_batch, token_type_ids = None,
                    attention_mask = input_mask_batch, labels = labels_batch)
        # backprob
        loss.backward()
        # track train loss
        tr_loss += loss.item()
        nb_tr_examples += input_batch.size(0)
        nb_tr_steps += 1
        # gradient clipping to avoid exploding gradients
        torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=MAX_GRAD_NORM)
        # update parameters
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        model.zero_grad()
        # Increment progress bar
        pbar_train.update(BATCH_SIZE)
        train_loss = tr_loss/nb_tr_steps
        description_train.set_description_str(f"Epoch: {epoch}/{EPOCHS} - Loss: {train_loss}")
    #Print train loss per epoch
    print("Train loss: {:.5f}".format(tr_loss/nb_tr_steps))
    train_losses.append(tr_loss/nb_tr_steps)
    pbar_train.close()

    #VALIDATION STEP 
    model.eval()
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    predictions, true_labels = [], []
    # Validation progress bar
    pbar_val = tqdm(total = len(val_input), leave = True)

    for batch in val_dataloader:
        # add batch to gpu
        batch = tuple(t.to(device) for t in batch)
        input_batch, input_mask_batch, labels_batch = batch

        with torch.no_grad():
            outputs = model(input_batch, token_type_ids = None,
                                attention_mask = input_mask_batch, labels = labels_batch)
            temp_eval_loss, logits = outputs[:2]
            logits = logits.detach().cpu().numpy()
            label_ids = labels_batch.to("cpu").numpy()
            predictions.extend([list(p) for p in np.argmax(logits, axis = 2)])
            true_labels.append(label_ids)
            temp_eval_accuracy = flat_accuracy(logits, label_ids)
            eval_loss += temp_eval_loss.mean().item()
            eval_accuracy += temp_eval_accuracy

            nb_eval_examples += input_batch.size(0)
            nb_eval_steps += 1
        pbar_val.update(BATCH_SIZE)
        eval_loss = eval_loss/nb_eval_steps
        val_accuracy = eval_accuracy/nb_eval_steps
        predicted_entities = [uniq_ent[p_i] for p in predictions for p_i in p]
        valid_entities = [uniq_ent[l_ii] for l in true_labels for l_i in l for l_ii in l_i]
        f1_score_val = f1_score(valid_entities, predicted_entities)
        description_val.set_description_str(f"Validation Scores -> Loss : {eval_loss} , Accuracy : 
                                                                  {val_accuracy} , F1-Score : {f1_score_val}")
    pbar_val.close()

The code above didnt converge I stuck at 0.020... loss and 0.40 around for f1 score.. Also I tried with your berturk.json file with farm and I used experiment which is presented below

from farm.experiment import run_experiment, load_experiments

experiments = load_experiments("/home/karaz/Desktop/BERT_NER/FARM_TransferLearning/berturk.json")

run_experiment(experiments[0])

I can go with the FARM also but when I executed that code above for farm the f1 score is %55. What am I missing here :) Thanks

kemalaraz commented 4 years ago

Sorry to bother you this much but I also tried with different datasets still no luck..

stefan-it commented 4 years ago

Hi @kemalaraz ,

I'm currently working on an evaluation for the WikiANN (balanced) dataset, so you could use this dataset as well to test the implementation:

Dataset can be retrieved from here and the dataset needs to be pre-processed, e.g. with:

import sys

filename = sys.argv[1]

with open(filename, "rt") as f_p:
    for line in f_p:
        line = line.rstrip()

        if not line:
            print("")
            continue

        token, label = line.split("\t")

        assert token.startswith("tr:")

        print(token[3:], label)

Just use python3 preprocess.py train > train.txt and so on for train, dev and test.

It's important that the final dataset format is:

Büyük B-ORG
Ermenistan I-ORG
kurma O
girişimleri O
sona O
ermiştir O
. O

...

There's a token/label pair for each line, delimited by a space. An empty line denotes a space.

Just make sure, that the dataset format is ok.

The 55% depends on the dataset, e.g. when your dataset is not balanced, or very noise (like the WNUT NER datasets) this would explain the bad result. However, you should try to train on bert-base-multilingual-cased as well :)

kemalaraz commented 4 years ago

Thank you so much for your interest, I ll try your suggestions and get back to you. Btw I got 55% with the dataset that you wrote on another issue which you said you plan to evaluate the model for ner with https://github.com/UKPLab/linspector/tree/master/extrinsic this dataset. I used the same set and parsed it one word and token at a time however got bad results. In addition to that the dataset should be like Büyük B-ORG however when preparing the input for the bert it should be "This is a sentence" and output " 'O' ,'O' ..." right?

kemalaraz commented 4 years ago

I am sorry but that dataset didnt work as well. I hope you put your trained model for NER to this repo otherwise it seems impossible to replicate your results.

Thanks

stefan-it commented 4 years ago

Hi @kemalaraz,

could you try to reproduce the NER results with the following commands:

Download WikiANN data

Just clone the latest version of Transformers install it via pip install -e . and put the following bash script in the examples/ner folder:

mkdir tr-data

cd tr-data

for file in train.txt dev.txt test.txt labels.txt
do
  wget https://schweter.eu/storage/turkish-bert-wikiann/$file
done

cd ..

It will download the pre-processed datasets with training, dev and test splits and put them in a tr-data folder.

Run pre-training

After downloading the dataset, pre-training can be started. Just set the following environment variables:

export MAX_LENGTH=128
export BERT_MODEL=dbmdz/bert-base-turkish-cased

export OUTPUT_DIR=tr-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=625
export SEED=1

Then run pre-training:

python3 run_ner.py --data_dir ./tr-data \
--model_type bert \
--labels ./tr-data/labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR-$SEED \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict \
--fp16

This will fine-tune the Turkish BERT model for 3 epochs. Results should be around:

cat tr-model-1/eval_results.txt
cat tr-model-1/test_results.txt

In my experiment I got 92.60% on the development set and 92.19% on the test set.

I hope I can upload fine-tuned models this week.

savasy commented 4 years ago

I applied the same experiment with Stefan to reproduce the same resulta. However, I got errror ImportError: cannot import name 'EvalPrediction' from 'transformers But keep going , I share the results soon

kemalaraz commented 4 years ago

Works like a charm, cheers bud, however I still wonder with which dataset you used to get 95% on NER task?

savasy commented 4 years ago

With my experiments, I got the similar results with @stefan-it as follows

Eval Results:

precision = 0.916400580551524 recall = 0.9342309684101502 f1 = 0.9252298787412536 loss = 0.11335893666411284

Test Results: precision = 0.9192058759362955 recall = 0.9303010230367262 f1 = 0.9247201697271198 loss = 0.11182546521618497

kemalaraz commented 4 years ago

When I load the trained model with:

label_list = ["B-LOC", "B-ORG", "B-PER", "I-LOC", "I-ORG", "I-PER", "O"]
model = AutoModelForTokenClassification.from_pretrained("./transformers/examples/ner/tr-model-1/checkpoint-1875")
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sentences[5])))
inputs = tokenizer.encode(sentences[4], return_tensors = "pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim = 2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

### In config file label2id is:

  "label2id": {
    "B-LOC": 0,
    "B-ORG": 1,
    "B-PER": 2,
    "I-LOC": 3,
    "I-ORG": 4,
    "I-PER": 5,
    "O": 6
  }

and test it on another dataset I am getting really bad results. I can give you more information if you want or share the model and stuff with you. Can you please help me on that?

stefan-it commented 4 years ago

Could you post an example sentence or more sentences from your dataset, so that I can test it 🤔

savasy commented 4 years ago

I trained and uploaded fine-tuned model to the transformers repo as savasy/bert-base-turkish-ner-cased

please check the following code


from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
ner=pipeline('ner', model=model, tokenizer=tokenizer)
ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")

output looks like this [{'word': 'Mustafa', 'score': 0.9938516616821289, 'entity': 'B-PER'}, {'word': 'Kemal', 'score': 0.9881671071052551, 'entity': 'I-PER'}, {'word': 'Atatürk', 'score': 0.9957979321479797, 'entity': 'I-PER'}, {'word': 'Samsun', 'score': 0.9059973359107971, 'entity': 'B-LOC'}]

kemalaraz commented 4 years ago

I got the dataset from https://github.com/UKPLab/linspector/blob/master/extrinsic/data/ner/tr/test.txt and getting bad results. @savasy thank you for your interest and for the example :) I trained the model there is no problem with that I am trying the model with various datasets that are unseen to the model and not getting good results but I will double check everything again when I have time. But again thank you :)

stefan-it commented 4 years ago

The NER dataset in the linspector repo was automatically annotated - just found the paper: https://arxiv.org/abs/1702.02363

When looking at the test set, I could find some tagging errors:

Tirol X B-X B-LOCATION
ve X B-X O
Vorarlberg'i X B-X O
Avusturya X B-X B-LOCATION
İmparatorluğu'na X B-X O
bırakan X B-X O
Bavyera X B-X B-LOCATION
, X B-X O
Aschaffenburg X B-X B-MISC
ile X B-X O
Hessen X B-X B-LOCATION
Darmstadt'ın X B-X O
bir X B-X O
kısmını X B-X O
elde X B-X O
etti X B-X O
. X B-X O

So e.g. "Aschaffenburg" is clearly a location, "Darmstadt" and "Vorarlberg" as well.

@savasy Thanks for uploading the model :+1: I just used it to tag the sentence, and "Aschaffenburg", "Darmstadt" and "Vorarlberg" are tagged as locations.

savasy commented 4 years ago

You are welcome @stefan-it @kemalaraz And "İmparatorluğu" must be I-Location as well. It seems the dataset linspector could be misleading. Can you share other dataset here @kemalaraz and @stefan-it so that We can train/test it

kemalaraz commented 4 years ago

You can get that from this link however I haven't tried the model on that dataset I ma working on a different project right now, will write a parser for enamex format and then try that.nerdata.txt

savasy commented 4 years ago

Thank you, it is fairly enough data 10889 TYPE="LOCATION" 10031 TYPE="ORGANIZATION" 16293 TYPE="PERSON"

I will give it a try

savasy commented 4 years ago

The performance for the data given by @kemalaraz is as follows

savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt precision = 0.9461980692049029 recall = 0.959309358847465 f1 = 0.9527086063783312 loss = 0.037054269206847804

savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt precision = 0.9458370635631155 recall = 0.9588201928530913 f1 = 0.952284378344882 loss = 0.035431676572445225

savasy commented 4 years ago

Hi @kemalaraz and @stefan-it
Where did you take this ner dataset that kemal shared as follow https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt

is it from the paper below ? https://www.aclweb.org/anthology/P11-3019.pdf

stefan-it commented 4 years ago

Hi @savasy , I checked that dataset and it should be idential to the one that I've used for the NER experiments :)

stefan-it / turkish-bert

Cannot find the NER Model #10

Download WikiANN data

Run pre-training