Open stefan-it opened 5 years ago
Hi Stefan,
thank you very much!
I trained a biLSTM as it is implemented in flair
's sequence tagger model. I did not manually label anything, but used only the data mentioned in the Jannidis paper; 70:20:10 for train, test, dev.
The training data are indeed officially available here in two different formats:
The repository figur-dev might also be of interest to you – but that's currently very work-in-progress, the code probably doesn't work completely out-of-the-box, but I've been planning to clean it up for quite some time.
In this repository, the corpus is in a third, tab-separated format, which can be loaded smoothly into flair
. More details about the corpus itself can be found in this paper.
Here are some stats (the whole project was for a term paper, by the way):
The experiments you suggested are indeed very interesting, especially the fine-tuning, which I haven't even experimented with yet. So far I have tried the following:
flair
tutorial) as first layer. These were my results:But I did not investigate further why e.g. the Flair embeddings (FlairEmbeddings("german-forward")
) scored so poorly. Maybe resume training as you mentioned would lead to better results?
However, my results with this approach were unfortunately not as successful as I had hoped (10-fold cross-validation with a mean of only ~73% F1) – the baseline of Jannidis et al. with an F1 score of ~90% was already very high anyway.
The 90% baseline is high, but I'm sure there's more to it. I imagine you could achieve more here with fine-tuning?
And regarding your question: the corpus contains more annotations besides named entities, that is coreferences (for coreference resolution training) and (in)direct speech (for training a model for automatic recognition of speech – if you are interested, there is a DFG-funded project called Redewiedergabe). Now consider the following sentence:
<per id="1">Bob</per> enters the room, <per id="1">he</per> says something to <per id="2">Alice</per>.
Bob
has the "Koreferenz per Identität" identifier 1
, just as he
does, because he
refers to the same person Bob
. I'm not quite sure, but this might have been a feature for their CRF model.
By the way, the pipeline that belongs to the Jannidis et al. paper can be found here.
Thanks for your detailed explanation :+1:
Here's what I found out in the last couple of days:
The training data has some flaws:
Most problematic is, that IOB is not used. Instead of using a class for "word" it should be labeled as "O". Here's a small script, that generates an IOB formatted dataset:
import sys
filename = sys.argv[1]
with open(filename, 'rt') as f_p:
prev_ner = ""
for line in f_p:
line = line.rstrip()
columns = line.split("\t")
if line and len(columns) == 3:
#columns = {0: 'text', 1: 'pos', 2: 'ner'}
id_, text, ner = columns
new_ner = ner
if ner != "word":
if prev_ner == ner:
new_ner = "I-" + ner
else:
new_ner = "B-" + ner
else:
new_ner = "O"
prev_ner = ner
print(f"{id_}\t{text}\t{new_ner}")
else:
print("")
It should also take care of B
and I
notations. Just call it like python3 preprocess.py input-file > iob-labeled-file
.
After the generation of a fully IOB compatible dataset, I trained a model with:
from pathlib import Path
from typing import List
from flair.data import Sentence, TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher
from flair.embeddings import WordEmbeddings, StackedEmbeddings, PooledFlairEmbeddings, BertEmbeddings, \
FlairEmbeddings
from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher
# define columns
columns = {0: 'id', 1: 'text', 2: 'ner'}
# retrieve corpus using column format, data folder and the names of the train, dev and test files
corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(".", columns,
train_file='train.csv',
test_file='test.csv',
dev_file='dev.csv')
batch_size = 32
tag_type = 'ner'
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
embedding_types = [
WordEmbeddings('de'),
FlairEmbeddings('german-forward'),
FlairEmbeddings('german-backward')
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type,
use_crf=True)
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train(f'resources/taggers/figur-ner',
learning_rate=0.1,
mini_batch_size=batch_size,
max_epochs=500)
I could achieve a F1-score of 82.16% then :) Final output:
2019-05-01 01:14:05,755 loading file resources/taggers/figur-ner/best-model.pt
2019-05-01 01:14:23,381 MICRO_AVG: acc 0.6972 - f1-score 0.8216
2019-05-01 01:14:23,381 MACRO_AVG: acc 0.621 - f1-score 0.7427999999999999
2019-05-01 01:14:23,381 AppA tp: 100 - fp: 86 - fn: 172 - tn: 100 - precision: 0.5376 - recall: 0.3676 - accuracy: 0.2793 - f1-score: 0.4366
2019-05-01 01:14:23,381 AppTdfW tp: 615 - fp: 161 - fn: 178 - tn: 615 - precision: 0.7925 - recall: 0.7755 - accuracy: 0.6447 - f1-score: 0.7839
2019-05-01 01:14:23,382 Core tp: 645 - fp: 66 - fn: 62 - tn: 645 - precision: 0.9072 - recall: 0.9123 - accuracy: 0.8344 - f1-score: 0.9097
2019-05-01 01:14:23,382 pron tp: 2390 - fp: 482 - fn: 422 - tn: 2390 - precision: 0.8322 - recall: 0.8499 - accuracy: 0.7256 - f1-score: 0.8410
Btw. my colleague and I just recently found the DROC corpus, but the XML annotations were so.... great that you found a way to parse it into a CoNLL-like format :+1:
I just looked at the test dataset size (sentences) and I think it is too high (normally, I would just use 10% of the total sentences as test data).
I think more time should be spent in preprocessing the dataset (proper sentence splitting, valid dataset) than in hyperparameter search now :) Would be great if you can integrate the corpus + preprocessing in this repository then :)
Thanks a lot! I'm going to start working on this again the next weeks and will implement your feedback and experiment with the Flair embeddings again to release a new, improved model.
Hi @severinsimmler,
thanks for creating this very interesting project!
I just browsed through the repository and looked at the paper, and I have some questions. Did you train the model on the data that is mentioned in the Jannidis paper, or did you manually label some texts?
Is the training data officially available and where can I obtain it?
I think a really good experiment would be the following:
a) use the large amount of literary text data and "resume" training of the Flair embeddings b) use the BERT multilingual model + fine tune it on the literary text data
From the Jannidis paper I did not quite get, what this exactly means:
I'm really looking forward to see a training example 🤗
Thanks in advance,
Stefan