Open kemalaraz opened 4 years ago
I am trying to replicate your NER results with hugginface take your base model as pre-trained and using BertForTokenClassification then train the model for NER but it is not converging. Can you elaborate me on that?
Hi @kemalaraz ,
yes, there are no fine-tuned models stored on the model hub at the moment, only the ones with a normal lm head.
For fine-tuning I used FARM, with corresponding configuration files in the ./config
folder of this repo.
Could you just paste the fine-tuning command that you've used with the run_ner.py
script? Maybe you should adjust the number of epochs (for NER I used 10 epochs) :)
Hello again @stefan-it , Thanks for a quick response, I wrote a big code chunk for it which is below:
# -*- coding: utf-8 -*-
import torch
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertConfig, AdamW
from transformers import BertForTokenClassification
from preprocess import create_bert_data
import argparse
from seqeval.metrics import f1_score
from tqdm import tqdm, trange
import numpy as np
from transformers import get_linear_schedule_with_warmup
EPOCHS = 10
BATCH_SIZE = 16
MAX_GRAD_NORM = 1.0
parser = argparse.ArgumentParser()
parser.add_argument("-t", "--trainpath", required = True, help = "Path to the train raw data")
parser.add_argument("-v", "--valpath", required = True, help = "Path to the val raw data")
parser.add_argument("-f", "--finetune", required = True, help = "Fine tuning option if false just the classifier layer is trained")
args = vars(parser.parse_args())
if args["finetune"] == "True":
fine_tune = True
if args["finetune"] == "False":
fine_tune = False
assert fine_tune is True or fine_tune is False, "Fine-tune must be a boolean value but given {}".format(args["finetune"])
#Tokenizer
tokenizer = BertTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
#Accuracy on a token level (balanced accuracy)
def flat_accuracy(preds, labels):
pred_flat = np.argmax(preds, axis=2).flatten()
labels_flat = labels.flatten()
return np.sum(pred_flat == labels_flat) / len(labels_flat)
#Set up if the script will run on cpu or gpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
train_input, train_labels, uniq_ent, ent_idx, train_attention_masks = create_bert_data("train.txt", train = True)
val_input, val_labels, val_attention_masks = create_bert_data(args["valpath"], train = False)
one_training_step = len(train_input) / BATCH_SIZE
train_input = torch.tensor(train_input)
train_labels = torch.tensor(train_labels)
train_attention_masks = torch.tensor(train_attention_masks)
val_input = torch.tensor(val_input)
val_labels = torch.tensor(val_labels)
val_attention_masks = torch.tensor(val_attention_masks)
#Define the DataLoader and shuffle the data at training time and test time pass them to SequentialSampler
train_dataset = TensorDataset(train_input, train_attention_masks, train_labels)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler = train_sampler, batch_size = BATCH_SIZE)
val_dataset = TensorDataset(val_input, val_attention_masks, val_labels)
val_sampler = RandomSampler(val_dataset)
val_dataloader = DataLoader(val_dataset, sampler = val_sampler, batch_size = BATCH_SIZE)
#Write unique entities and their ids into a file
with open("entity_idx.txt", "w") as foo:
foo.write(str(ent_idx))
print(ent_idx)
#Load the model
model = BertForTokenClassification.from_pretrained("dbmdz/bert-base-turkish-cased", num_labels = len(ent_idx))
#For gpu
model.cuda()
num_training_steps = int(one_training_step/BATCH_SIZE)*EPOCHS
num_warmup_steps = int(num_training_steps * 0.4)
#Initilize the optimizer
optimizer = AdamW(model.parameters(), lr = 5e-5, correct_bias=False) #To reproduce BertAdam specific behavior set correct_bias=False
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps) #PyTorch schedule
#For printing under progress bars
description_train = tqdm(total=0, position=1, bar_format='{desc}')
description_val = tqdm(total=0, position=1, bar_format='{desc}')
train_losses = []
# TRAINING LOOP
for epoch in range(EPOCHS):
print("{:d}/{:d} EPOCH".format(epoch, EPOCHS))
model.train()
tr_loss = 0
nb_tr_examples, nb_tr_steps = 0, 0
# Training progress bar
pbar_train = tqdm(total = len(train_input), leave = False)
#TRAINING STEP
for step, batch in enumerate(train_dataloader):
# add batch to gpu
input_batch, input_mask_batch, labels_batch = [b.to(device) for b in batch]
# forward pass
loss,_ = model(input_batch, token_type_ids = None,
attention_mask = input_mask_batch, labels = labels_batch)
# backprob
loss.backward()
# track train loss
tr_loss += loss.item()
nb_tr_examples += input_batch.size(0)
nb_tr_steps += 1
# gradient clipping to avoid exploding gradients
torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=MAX_GRAD_NORM)
# update parameters
optimizer.step()
scheduler.step()
optimizer.zero_grad()
model.zero_grad()
# Increment progress bar
pbar_train.update(BATCH_SIZE)
train_loss = tr_loss/nb_tr_steps
description_train.set_description_str(f"Epoch: {epoch}/{EPOCHS} - Loss: {train_loss}")
#Print train loss per epoch
print("Train loss: {:.5f}".format(tr_loss/nb_tr_steps))
train_losses.append(tr_loss/nb_tr_steps)
pbar_train.close()
#VALIDATION STEP
model.eval()
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
predictions, true_labels = [], []
# Validation progress bar
pbar_val = tqdm(total = len(val_input), leave = True)
for batch in val_dataloader:
# add batch to gpu
batch = tuple(t.to(device) for t in batch)
input_batch, input_mask_batch, labels_batch = batch
with torch.no_grad():
outputs = model(input_batch, token_type_ids = None,
attention_mask = input_mask_batch, labels = labels_batch)
temp_eval_loss, logits = outputs[:2]
logits = logits.detach().cpu().numpy()
label_ids = labels_batch.to("cpu").numpy()
predictions.extend([list(p) for p in np.argmax(logits, axis = 2)])
true_labels.append(label_ids)
temp_eval_accuracy = flat_accuracy(logits, label_ids)
eval_loss += temp_eval_loss.mean().item()
eval_accuracy += temp_eval_accuracy
nb_eval_examples += input_batch.size(0)
nb_eval_steps += 1
pbar_val.update(BATCH_SIZE)
eval_loss = eval_loss/nb_eval_steps
val_accuracy = eval_accuracy/nb_eval_steps
predicted_entities = [uniq_ent[p_i] for p in predictions for p_i in p]
valid_entities = [uniq_ent[l_ii] for l in true_labels for l_i in l for l_ii in l_i]
f1_score_val = f1_score(valid_entities, predicted_entities)
description_val.set_description_str(f"Validation Scores -> Loss : {eval_loss} , Accuracy :
{val_accuracy} , F1-Score : {f1_score_val}")
pbar_val.close()
The code above didnt converge I stuck at 0.020... loss and 0.40 around for f1 score.. Also I tried with your berturk.json file with farm and I used experiment which is presented below
from farm.experiment import run_experiment, load_experiments
experiments = load_experiments("/home/karaz/Desktop/BERT_NER/FARM_TransferLearning/berturk.json")
run_experiment(experiments[0])
I can go with the FARM also but when I executed that code above for farm the f1 score is %55. What am I missing here :) Thanks
Sorry to bother you this much but I also tried with different datasets still no luck..
Hi @kemalaraz ,
I'm currently working on an evaluation for the WikiANN (balanced) dataset, so you could use this dataset as well to test the implementation:
Dataset can be retrieved from here and the dataset needs to be pre-processed, e.g. with:
import sys
filename = sys.argv[1]
with open(filename, "rt") as f_p:
for line in f_p:
line = line.rstrip()
if not line:
print("")
continue
token, label = line.split("\t")
assert token.startswith("tr:")
print(token[3:], label)
Just use python3 preprocess.py train > train.txt
and so on for train
, dev
and test
.
It's important that the final dataset format is:
Büyük B-ORG
Ermenistan I-ORG
kurma O
girişimleri O
sona O
ermiştir O
. O
...
There's a token/label pair for each line, delimited by a space. An empty line denotes a space.
Just make sure, that the dataset format is ok.
The 55% depends on the dataset, e.g. when your dataset is not balanced, or very noise (like the WNUT NER datasets) this would explain the bad result. However, you should try to train on bert-base-multilingual-cased
as well :)
Thank you so much for your interest, I ll try your suggestions and get back to you. Btw I got 55% with the dataset that you wrote on another issue which you said you plan to evaluate the model for ner with https://github.com/UKPLab/linspector/tree/master/extrinsic this dataset. I used the same set and parsed it one word and token at a time however got bad results. In addition to that the dataset should be like Büyük B-ORG however when preparing the input for the bert it should be "This is a sentence" and output " 'O' ,'O' ..." right?
I am sorry but that dataset didnt work as well. I hope you put your trained model for NER to this repo otherwise it seems impossible to replicate your results.
Thanks
Hi @kemalaraz,
could you try to reproduce the NER results with the following commands:
Just clone the latest version of Transformers install it via pip install -e .
and put the following bash script in the examples/ner
folder:
mkdir tr-data
cd tr-data
for file in train.txt dev.txt test.txt labels.txt
do
wget https://schweter.eu/storage/turkish-bert-wikiann/$file
done
cd ..
It will download the pre-processed datasets with training, dev and test splits and put them in a tr-data
folder.
After downloading the dataset, pre-training can be started. Just set the following environment variables:
export MAX_LENGTH=128
export BERT_MODEL=dbmdz/bert-base-turkish-cased
export OUTPUT_DIR=tr-model
export BATCH_SIZE=32
export NUM_EPOCHS=3
export SAVE_STEPS=625
export SEED=1
Then run pre-training:
python3 run_ner.py --data_dir ./tr-data \
--model_type bert \
--labels ./tr-data/labels.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR-$SEED \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_gpu_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--do_train \
--do_eval \
--do_predict \
--fp16
This will fine-tune the Turkish BERT model for 3 epochs. Results should be around:
cat tr-model-1/eval_results.txt
cat tr-model-1/test_results.txt
In my experiment I got 92.60% on the development set and 92.19% on the test set.
I hope I can upload fine-tuned models this week.
I applied the same experiment with Stefan to reproduce the same resulta. However, I got errror ImportError: cannot import name 'EvalPrediction' from 'transformers But keep going , I share the results soon
Works like a charm, cheers bud, however I still wonder with which dataset you used to get 95% on NER task?
With my experiments, I got the similar results with @stefan-it as follows
Eval Results:
precision = 0.916400580551524 recall = 0.9342309684101502 f1 = 0.9252298787412536 loss = 0.11335893666411284
Test Results: precision = 0.9192058759362955 recall = 0.9303010230367262 f1 = 0.9247201697271198 loss = 0.11182546521618497
When I load the trained model with:
label_list = ["B-LOC", "B-ORG", "B-PER", "I-LOC", "I-ORG", "I-PER", "O"]
model = AutoModelForTokenClassification.from_pretrained("./transformers/examples/ner/tr-model-1/checkpoint-1875")
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sentences[5])))
inputs = tokenizer.encode(sentences[4], return_tensors = "pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim = 2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])
### In config file label2id is:
"label2id": {
"B-LOC": 0,
"B-ORG": 1,
"B-PER": 2,
"I-LOC": 3,
"I-ORG": 4,
"I-PER": 5,
"O": 6
}
and test it on another dataset I am getting really bad results. I can give you more information if you want or share the model and stuff with you. Can you please help me on that?
Could you post an example sentence or more sentences from your dataset, so that I can test it 🤔
I trained and uploaded fine-tuned model to the transformers repo as savasy/bert-base-turkish-ner-cased
please check the following code
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
ner=pipeline('ner', model=model, tokenizer=tokenizer)
ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")
output looks like this [{'word': 'Mustafa', 'score': 0.9938516616821289, 'entity': 'B-PER'}, {'word': 'Kemal', 'score': 0.9881671071052551, 'entity': 'I-PER'}, {'word': 'Atatürk', 'score': 0.9957979321479797, 'entity': 'I-PER'}, {'word': 'Samsun', 'score': 0.9059973359107971, 'entity': 'B-LOC'}]
I got the dataset from https://github.com/UKPLab/linspector/blob/master/extrinsic/data/ner/tr/test.txt and getting bad results. @savasy thank you for your interest and for the example :) I trained the model there is no problem with that I am trying the model with various datasets that are unseen to the model and not getting good results but I will double check everything again when I have time. But again thank you :)
The NER dataset in the linspector repo was automatically annotated - just found the paper: https://arxiv.org/abs/1702.02363
When looking at the test set, I could find some tagging errors:
Tirol X B-X B-LOCATION
ve X B-X O
Vorarlberg'i X B-X O
Avusturya X B-X B-LOCATION
İmparatorluğu'na X B-X O
bırakan X B-X O
Bavyera X B-X B-LOCATION
, X B-X O
Aschaffenburg X B-X B-MISC
ile X B-X O
Hessen X B-X B-LOCATION
Darmstadt'ın X B-X O
bir X B-X O
kısmını X B-X O
elde X B-X O
etti X B-X O
. X B-X O
So e.g. "Aschaffenburg" is clearly a location, "Darmstadt" and "Vorarlberg" as well.
@savasy Thanks for uploading the model :+1: I just used it to tag the sentence, and "Aschaffenburg", "Darmstadt" and "Vorarlberg" are tagged as locations.
You are welcome @stefan-it @kemalaraz And "İmparatorluğu" must be I-Location as well. It seems the dataset linspector could be misleading. Can you share other dataset here @kemalaraz and @stefan-it so that We can train/test it
You can get that from this link however I haven't tried the model on that dataset I ma working on a different project right now, will write a parser for enamex format and then try that.nerdata.txt
Thank you, it is fairly enough data 10889 TYPE="LOCATION" 10031 TYPE="ORGANIZATION" 16293 TYPE="PERSON"
I will give it a try
The performance for the data given by @kemalaraz is as follows
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt precision = 0.9461980692049029 recall = 0.959309358847465 f1 = 0.9527086063783312 loss = 0.037054269206847804
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt precision = 0.9458370635631155 recall = 0.9588201928530913 f1 = 0.952284378344882 loss = 0.035431676572445225
Hi @kemalaraz and @stefan-it
Where did you take this ner dataset that kemal shared as follow
https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt
is it from the paper below ? https://www.aclweb.org/anthology/P11-3019.pdf
Hi @savasy , I checked that dataset and it should be idential to the one that I've used for the NER experiments :)
It seems huggingface repository contains only the base model, I couldn't find the model and tokenizer related to the model for named entity recognition. Where can I find the trained NER model and if it is not too much to ask how can I load and use it easily?