Integrate Bart Coreference version 2.0

GoogleCodeExporter commented 9 years ago

I got this mail via the corpora list and it sound interesting

---

From: Olga Uryupina <uryupina@gmail.com>
Subject: [Corpora-List] BART coreference resolver: v2.0 released
To: "corpora@uib.no" <corpora@uib.no>

Dear CorporaList members,

BART (Beautiful Anaphora Resolution Toolkit) version 2.0 has been
released. This new version of the BART toolkit for developing anaphora
resolution systems is much improved over the previous semi-official
version released in 2008. Please visit the BART website to get the
latest version:

www.bart-anaphora.org

Please note the new name of our website! Unfortunately, the old
website (bart-coref.org) has been taken over by a third party, so
please remove it from your bookmarks.

More on BART:

BART, the Beautiful Anaphora Resolution Toolkit, was initially
developed during the project Exploiting Lexical and Encyclopedic
Resources For Entity Disambiguation at the Johns Hopkins Summer
Workshop 2007, but has since been constantly revised. BART performs
automatic coreference resolution (Multilingual), including all
necessary preprocessing steps to operate on the raw text input (for
English only). BART incorporates a variety of machine learning
approaches and can use several machine learning toolkits, including
WEKA and an included MaxEnt implementation. BART has shown a
state-of-the-art performance at recent evaluation campaigns
(SemEval-2010 Task 1, CoNLL-2011, CoNLL-2012)

BART is an open-source java toolkit and runs on Linux, MacOS and Windows.

What's new in the second release:

- 2 models for the out-of-the-box coreference for English (trained on OntoNotes)
- Generic Language Plugin supports coreference resolution in your
favourite language
- Italian and German Language Plugins
- More features
- More Preprocessing Pipelines
- Tabular import/export format
- Presets for straightforward testing
- Bugs fixed

BART development team

Original issue reported on code.google.com by nico.erbs@gmail.com on 1 Oct 2013 at 12:36

GoogleCodeExporter commented 9 years ago

I downloaded it.
For English, it does not seem to make sense to integrate it in DKPro Core, 
since breaking up the BART end-to-end system for English will lead to a 
decrease in performance.

For other languages (e.g. German), quite some effort seems to be required to 
ensure the minimum level of preprocessing needed to run BART.

Original comment by eckle.kohler on 1 Oct 2013 at 1:10

GoogleCodeExporter commented 9 years ago

I don't understand why it does not make sense to integrate it.

Original comment by torsten....@gmail.com on 1 Oct 2013 at 1:12

GoogleCodeExporter commented 9 years ago

*SEEM*

Original comment by eckle.kohler on 1 Oct 2013 at 1:15

GoogleCodeExporter commented 9 years ago

Ok, let me rephrase:
As you downloaded the tool and had a look, why do you think it does not _seem_ 
to make sense to integrate it.

Original comment by torsten....@gmail.com on 1 Oct 2013 at 1:20

GoogleCodeExporter commented 9 years ago

I can imagine that it would be quite some effort to integrate it. But I'd 
rather be curious why a decrease in performance is to be expected?  I assume 
you mean "performance" in the sense of quality, not speed. Except for the 
effort, I think it would be nice in any case to integrate it, because we simply 
don't have much for coreference yet. I think (hope) that most of the 
preprocessing required should already be present in DKPro Core.

Original comment by richard.eckart on 1 Oct 2013 at 1:37

GoogleCodeExporter commented 9 years ago

Torsten wrote:
>> why do you think it does not _seem_ to make sense to integrate it.
My comment relates to the effort associated with integrating it. For English, 
we also have the Stanford Coreference Resolver which is state.of-the-art.
So the question is: is it worth the effort for English?

Richard wrote:
>> I assume you mean "performance" in the sense of quality, not speed.
right

>> I think (hope) that most of the preprocessing required should already be 
present in DKPro Core.

For German, the morphosyntactic preprocessing is not well covered right now.
BART makes heavy use of the morphosyntactic properties gender, number and case.

I think, we would have to create an appropriate type for this kind of 
information first.

Original comment by eckle.kohler on 3 Oct 2013 at 8:02

GoogleCodeExporter commented 9 years ago

Some background on the performance on other languages:

"a language-agnostic
system (designed primarily for English) can achieve a per-
formance level in high forties (MUC F-score) when re-
trained and tested on a new language, at least on gold
mention boundaries. Though this number might appear
low, note that it is a baseline requiring no extra engineer-
ing." see http://www.lrec-conf.org/proceedings/lrec2010/pdf/755_Paper.pdf

This performance is really low. I would not want to use such a component.

Original comment by eckle.kohler on 3 Oct 2013 at 8:06

GoogleCodeExporter commented 9 years ago

I believe the Stanford coreferencer also uses such information (gender, etc.) 
but it brings its own resources for these things. I'd tend to try feed Bart 
with anything that we already can produce (token, pos, lemma, named entity, 
etc.) and let Bart handle all the things we cannot yet produce (e.g. gender and 
so on). Getting anything to work would already be quite nice, factoring out 
additional steps could happen afterwards.

Btw. I also downloaded Bart 2.0 and had a brief look. I still have no idea 
where to hook in :( It seems the code does not only contain a single component 
or pipeline, but is rather a full coreference construction kit with many things 
that one actually wouldn't need only to do the "default" coref resolution.

Original comment by richard.eckart on 3 Oct 2013 at 8:13

GoogleCodeExporter commented 9 years ago

As far as I understand, regarding German, BART does not bring the preprocessing 
resources:

README:
"We do not support preprocessing for languages other than English. So, to run 
BART on another language, you first have to preprocess your data yourself, 
generating all the necessary markable levels, including the "markable" level 
that contains info on the mentions. In sample/generic-min, we show the minimal 
amount of information to be provided to BART to run any experiment. In 
sample/generic-max, we show the same documents, but with much more information 
encoded both via MMAX levels and via attributes on the "markable" level."

...
"Prepare your dataset in the MMAX format, making sure that you include at least 
all the information shown in the sample/generic-min example (that is: tokens in 
Basedata/*words.xml, coreference levels, pos levels, markable levels specifying 
markable_id and span for each markable). "

So if you have a look at the coreference levels, you find a very rich 
annotation (in generic-min!) that would require morphosyntactic annotation as 
well as our new SemanticFieldAnnotator (category="concrete")
e.g.

sample/generic-min/train/markables/wsjarrau_1128_coref_level.xml
<markable id="markable_77" span="word_1..word_4" generic="generic-no" 
person="per3" related_object="no" gram_fnc="subj" number="sing" reference="new" 
category="concrete" mmax_level="coref" gender="neut" min_words="picture" 
min_ids="word_4" coref_set="set_48"/>

Original comment by eckle.kohler on 3 Oct 2013 at 8:27

GoogleCodeExporter commented 9 years ago

Writing out data in any particular MMAX2 dialect to have it processed by Bart 
isn't something I would consider particularly desirable. I mean, there must be 
some way to construct a model in-memory and pass that to whatever parts of Bart 
perform the actual processing.

I'm a bit confused about that file 
(sample/generic-min/train/markables/wsjarrau_1128_coref_level.xml). I would 
expect that the tool comes with pretrained models and that the "coref" layer 
would be the output. 

So I gather, there is not only no pre-processing included for German, but there 
are also no models included for German. In which case, we would need to train 
our own models... ok, even if we wanted to do that, on which data? 
Preprocessing for German producing these features mentioned above is one thing, 
but in addition, we would need gold coreference annotations, right?

One of the main points of integrating many of the tools we integrate is, that 
we don't have to train models, because these tools already come with models. If 
Bart does not work out-of-the-box, then I actually do wonder if it's worth 
bothering with it. I guess for English it works at least, doesn't it?

Original comment by richard.eckart on 3 Oct 2013 at 8:42

GoogleCodeExporter commented 9 years ago

>>So I gather, there is not only no pre-processing included for German, but 
there are also no models included for German. In which case, we would need to 
train our own models... ok, even if we wanted to do that, on which data? 
Preprocessing for German producing these features mentioned above is one thing, 
but in addition, we would need gold coreference annotations, right?

I had a look in the GermanLanguagePlugin. This is very basic and partly hacky, 
but could be extended.

There is a dataset for German: http://stel.ub.edu/semeval2010-coref/node/7

- task: Detection of full coreference chains, composed of named entities, 
pronouns, and full noun phrases.
- data: "The data set comes from the TüBa-D/Z Treebank (Hinrichs et al. 2005), 
a German newspaper corpus based on data taken from the daily issues of "die 
tageszeitung" (taz).
Hand-annotated with inflectional morphology, constituent structure, grammatical 
functions, and anaphoric and coreference relations.
Training: 415k words."

>> I guess for English it works at least, doesn't it? 

right, just tried it - out of the box as a web demo

Original comment by eckle.kohler on 3 Oct 2013 at 9:00

GoogleCodeExporter commented 9 years ago

I'm not sure if I made my doubts regarding the mmax coref layer you previously 
referred to sufficiently explicit. I'll retry (but I didn't do any further 
investigation yet).

The min example contains two layers: markable, pos, and coref

The max example contains more layer: markable, chunk, enamex, lemma, morph, 
parse, phrase, unit, and coref

I expect that coref is the output of BART while the other layers are input. So 
I would expect that minimally, BART can work with pos information. If there is 
morphological information in the coref layer, I would expect that to be ignored 
or be generated by BART as part of the processing - but not prior to the 
processing.

The "morph" layer in the max example appears to contain only lemma information 
- in fact, it appears to be the same as the "lemma" layer.

There is a layer "markable" with additional semantic information in the max 
example, but this information is not present in the min example (the layer is 
there however).

So, I suppose that at least for English, it should be possible to get quite far 
with the pre-processing components that we have, possibly including the 
SemanticFieldAnnotator or something equivalent which may be included directly 
with Brat (based on WordNet). Since English works out-of-the-box, there may be 
some kind of morphological analysis included with BRAT as well (also based on 
WordNet?).

Original comment by richard.eckart on 4 Oct 2013 at 12:33

GoogleCodeExporter commented 9 years ago

I did some code reading and think I have largely understood how BART works. 
This can be discussed in one of the upcoming meetings.

Regarding http://code.google.com/p/dkpro-core-asl/issues/detail?id=258#c12

- you are right regarding the mmax layers
- for English, we will get worse results than BART end-to-end, if we use just 
our preprocessing
- for German, we should employ POS tagging and parsing; but will probably get 
much worse results than for English, because of the German language plugin 
currently provided:

BART is kind of a knowledge-based system and the German language plugin is a 
bit knowledge-poor (yet) compared to the English language plugin.

Original comment by eckle.kohler on 4 Oct 2013 at 8:04

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 14 Aug 2014 at 10:05

Changed title: Integrate Bart Coreference version 2.0

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 22 Jan 2015 at 10:42

kulukimak / dkpro-core-asl

Integrate Bart Coreference version 2.0 #258