Training data - Githubissues

Hi @severinsimmler,

thanks for creating this very interesting project!

I just browsed through the repository and looked at the paper, and I have some questions. Did you train the model on the data that is mentioned in the Jannidis paper, or did you manually label some texts?

Is the training data officially available and where can I obtain it?

I think a really good experiment would be the following:

a) use the large amount of literary text data and "resume" training of the Flair embeddings b) use the BERT multilingual model + fine tune it on the literary text data

From the Jannidis paper I did not quite get, what this exactly means:

Koreferenz per Identität (ID), d.h. alle Referenzen auf die gleiche Figur erhalten die gleiche grafisch angezeigte ID.

I'm really looking forward to see a training example 🤗

Thanks in advance,

Stefan

Hi Stefan,

thank you very much!

I trained a biLSTM as it is implemented in flair's sequence tagger model. I did not manually label anything, but used only the data mentioned in the Jannidis paper; 70:20:10 for train, test, dev.

The training data are indeed officially available here in two different formats:

The repository figur-dev might also be of interest to you – but that's currently very work-in-progress, the code probably doesn't work completely out-of-the-box, but I've been planning to clean it up for quite some time.

In this repository, the corpus is in a third, tab-separated format, which can be loaded smoothly into flair. More details about the corpus itself can be found in this paper.

Here are some stats (the whole project was for a term paper, by the way): korpus

The experiments you suggested are indeed very interesting, especially the fine-tuning, which I haven't even experimented with yet. So far I have tried the following:

The core of the architecture was always the biLSTM (with optimized hyperparameters).
I was wondering if word embeddings somehow improve the performance of the model compared to the classic approach with conditional random fields from the Jannidis et al. paper. So I trained models with two context-agnostic (fastText, character embeddings), two context-dependent (Flair, BERT), and one stacked embedding with the best context-agnostic and best context-dependent embeddings (as mentioned in the fantastic flair tutorial) as first layer. These were my results:

Screenshot at 2019-04-30 18-57-27

But I did not investigate further why e.g. the Flair embeddings (FlairEmbeddings("german-forward")) scored so poorly. Maybe resume training as you mentioned would lead to better results?

However, my results with this approach were unfortunately not as successful as I had hoped (10-fold cross-validation with a mean of only ~73% F1) – the baseline of Jannidis et al. with an F1 score of ~90% was already very high anyway.

The 90% baseline is high, but I'm sure there's more to it. I imagine you could achieve more here with fine-tuning?

And regarding your question: the corpus contains more annotations besides named entities, that is coreferences (for coreference resolution training) and (in)direct speech (for training a model for automatic recognition of speech – if you are interested, there is a DFG-funded project called Redewiedergabe). Now consider the following sentence:

<per id="1">Bob</per>  enters the room, <per id="1">he</per> says something to <per id="2">Alice</per>.

Bob has the "Koreferenz per Identität" identifier 1, just as he does, because he refers to the same person Bob. I'm not quite sure, but this might have been a feature for their CRF model.

By the way, the pipeline that belongs to the Jannidis et al. paper can be found here.

Thanks for your detailed explanation :+1:

Here's what I found out in the last couple of days:

The training data has some flaws:

Sometimes, the first word of a sentence is prepended to the previous sentence
Some entries only have e.g. "word" (-> violates dataset format description)
IOB is not used

Most problematic is, that IOB is not used. Instead of using a class for "word" it should be labeled as "O". Here's a small script, that generates an IOB formatted dataset:

import sys

filename = sys.argv[1]

with open(filename, 'rt') as f_p:
    prev_ner = ""
    for line in f_p:
        line = line.rstrip()
        columns = line.split("\t")
        if line and len(columns) == 3:
            #columns = {0: 'text', 1: 'pos', 2: 'ner'}
            id_, text, ner = columns
            new_ner = ner
            if ner != "word":

                if prev_ner == ner:
                    new_ner = "I-" + ner
                else:
                    new_ner = "B-" + ner
            else:
                new_ner = "O"
            prev_ner = ner

            print(f"{id_}\t{text}\t{new_ner}")
        else:
            print("")

It should also take care of B and I notations. Just call it like python3 preprocess.py input-file > iob-labeled-file.

After the generation of a fully IOB compatible dataset, I trained a model with:

from pathlib import Path
from typing import List

from flair.data import Sentence, TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher
from flair.embeddings import WordEmbeddings, StackedEmbeddings, PooledFlairEmbeddings, BertEmbeddings, \
    FlairEmbeddings

from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher

# define columns
columns = {0: 'id', 1: 'text', 2: 'ner'}

# retrieve corpus using column format, data folder and the names of the train, dev and test files
corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(".", columns,
                                                              train_file='train.csv',
                                                              test_file='test.csv',
                                                              dev_file='dev.csv')

batch_size = 32
tag_type = 'ner'
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

embedding_types = [
    WordEmbeddings('de'),
    FlairEmbeddings('german-forward'),
    FlairEmbeddings('german-backward')
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)

from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train(f'resources/taggers/figur-ner',
                  learning_rate=0.1,
                  mini_batch_size=batch_size,
                  max_epochs=500)

I could achieve a F1-score of 82.16% then :) Final output:

2019-05-01 01:14:05,755 loading file resources/taggers/figur-ner/best-model.pt
2019-05-01 01:14:23,381 MICRO_AVG: acc 0.6972 - f1-score 0.8216
2019-05-01 01:14:23,381 MACRO_AVG: acc 0.621 - f1-score 0.7427999999999999
2019-05-01 01:14:23,381 AppA       tp: 100 - fp: 86 - fn: 172 - tn: 100 - precision: 0.5376 - recall: 0.3676 - accuracy: 0.2793 - f1-score: 0.4366
2019-05-01 01:14:23,381 AppTdfW    tp: 615 - fp: 161 - fn: 178 - tn: 615 - precision: 0.7925 - recall: 0.7755 - accuracy: 0.6447 - f1-score: 0.7839
2019-05-01 01:14:23,382 Core       tp: 645 - fp: 66 - fn: 62 - tn: 645 - precision: 0.9072 - recall: 0.9123 - accuracy: 0.8344 - f1-score: 0.9097
2019-05-01 01:14:23,382 pron       tp: 2390 - fp: 482 - fn: 422 - tn: 2390 - precision: 0.8322 - recall: 0.8499 - accuracy: 0.7256 - f1-score: 0.8410

Btw. my colleague and I just recently found the DROC corpus, but the XML annotations were so.... great that you found a way to parse it into a CoNLL-like format :+1:

I just looked at the test dataset size (sentences) and I think it is too high (normally, I would just use 10% of the total sentences as test data).

I think more time should be spent in preprocessing the dataset (proper sentence splitting, valid dataset) than in hyperparameter search now :) Would be great if you can integrate the corpus + preprocessing in this repository then :)

Thanks a lot! I'm going to start working on this again the next weeks and will implement your feedback and experiment with the Flair embeddings again to release a new, improved model.

severinsimmler / figur

Training data #2