Open GoogleCodeExporter opened 9 years ago
I downloaded it.
For English, it does not seem to make sense to integrate it in DKPro Core,
since breaking up the BART end-to-end system for English will lead to a
decrease in performance.
For other languages (e.g. German), quite some effort seems to be required to
ensure the minimum level of preprocessing needed to run BART.
Original comment by eckle.kohler
on 1 Oct 2013 at 1:10
I don't understand why it does not make sense to integrate it.
Original comment by torsten....@gmail.com
on 1 Oct 2013 at 1:12
*SEEM*
Original comment by eckle.kohler
on 1 Oct 2013 at 1:15
Ok, let me rephrase:
As you downloaded the tool and had a look, why do you think it does not _seem_
to make sense to integrate it.
Original comment by torsten....@gmail.com
on 1 Oct 2013 at 1:20
I can imagine that it would be quite some effort to integrate it. But I'd
rather be curious why a decrease in performance is to be expected? I assume
you mean "performance" in the sense of quality, not speed. Except for the
effort, I think it would be nice in any case to integrate it, because we simply
don't have much for coreference yet. I think (hope) that most of the
preprocessing required should already be present in DKPro Core.
Original comment by richard.eckart
on 1 Oct 2013 at 1:37
Torsten wrote:
>> why do you think it does not _seem_ to make sense to integrate it.
My comment relates to the effort associated with integrating it. For English,
we also have the Stanford Coreference Resolver which is state.of-the-art.
So the question is: is it worth the effort for English?
Richard wrote:
>> I assume you mean "performance" in the sense of quality, not speed.
right
>> I think (hope) that most of the preprocessing required should already be
present in DKPro Core.
For German, the morphosyntactic preprocessing is not well covered right now.
BART makes heavy use of the morphosyntactic properties gender, number and case.
I think, we would have to create an appropriate type for this kind of
information first.
Original comment by eckle.kohler
on 3 Oct 2013 at 8:02
Some background on the performance on other languages:
"a language-agnostic
system (designed primarily for English) can achieve a per-
formance level in high forties (MUC F-score) when re-
trained and tested on a new language, at least on gold
mention boundaries. Though this number might appear
low, note that it is a baseline requiring no extra engineer-
ing." see http://www.lrec-conf.org/proceedings/lrec2010/pdf/755_Paper.pdf
This performance is really low. I would not want to use such a component.
Original comment by eckle.kohler
on 3 Oct 2013 at 8:06
I believe the Stanford coreferencer also uses such information (gender, etc.)
but it brings its own resources for these things. I'd tend to try feed Bart
with anything that we already can produce (token, pos, lemma, named entity,
etc.) and let Bart handle all the things we cannot yet produce (e.g. gender and
so on). Getting anything to work would already be quite nice, factoring out
additional steps could happen afterwards.
Btw. I also downloaded Bart 2.0 and had a brief look. I still have no idea
where to hook in :( It seems the code does not only contain a single component
or pipeline, but is rather a full coreference construction kit with many things
that one actually wouldn't need only to do the "default" coref resolution.
Original comment by richard.eckart
on 3 Oct 2013 at 8:13
As far as I understand, regarding German, BART does not bring the preprocessing
resources:
README:
"We do not support preprocessing for languages other than English. So, to run
BART on another language, you first have to preprocess your data yourself,
generating all the necessary markable levels, including the "markable" level
that contains info on the mentions. In sample/generic-min, we show the minimal
amount of information to be provided to BART to run any experiment. In
sample/generic-max, we show the same documents, but with much more information
encoded both via MMAX levels and via attributes on the "markable" level."
...
"Prepare your dataset in the MMAX format, making sure that you include at least
all the information shown in the sample/generic-min example (that is: tokens in
Basedata/*words.xml, coreference levels, pos levels, markable levels specifying
markable_id and span for each markable). "
So if you have a look at the coreference levels, you find a very rich
annotation (in generic-min!) that would require morphosyntactic annotation as
well as our new SemanticFieldAnnotator (category="concrete")
e.g.
sample/generic-min/train/markables/wsjarrau_1128_coref_level.xml
<markable id="markable_77" span="word_1..word_4" generic="generic-no"
person="per3" related_object="no" gram_fnc="subj" number="sing" reference="new"
category="concrete" mmax_level="coref" gender="neut" min_words="picture"
min_ids="word_4" coref_set="set_48"/>
Original comment by eckle.kohler
on 3 Oct 2013 at 8:27
Writing out data in any particular MMAX2 dialect to have it processed by Bart
isn't something I would consider particularly desirable. I mean, there must be
some way to construct a model in-memory and pass that to whatever parts of Bart
perform the actual processing.
I'm a bit confused about that file
(sample/generic-min/train/markables/wsjarrau_1128_coref_level.xml). I would
expect that the tool comes with pretrained models and that the "coref" layer
would be the output.
So I gather, there is not only no pre-processing included for German, but there
are also no models included for German. In which case, we would need to train
our own models... ok, even if we wanted to do that, on which data?
Preprocessing for German producing these features mentioned above is one thing,
but in addition, we would need gold coreference annotations, right?
One of the main points of integrating many of the tools we integrate is, that
we don't have to train models, because these tools already come with models. If
Bart does not work out-of-the-box, then I actually do wonder if it's worth
bothering with it. I guess for English it works at least, doesn't it?
Original comment by richard.eckart
on 3 Oct 2013 at 8:42
>>So I gather, there is not only no pre-processing included for German, but
there are also no models included for German. In which case, we would need to
train our own models... ok, even if we wanted to do that, on which data?
Preprocessing for German producing these features mentioned above is one thing,
but in addition, we would need gold coreference annotations, right?
I had a look in the GermanLanguagePlugin. This is very basic and partly hacky,
but could be extended.
There is a dataset for German: http://stel.ub.edu/semeval2010-coref/node/7
- task: Detection of full coreference chains, composed of named entities,
pronouns, and full noun phrases.
- data: "The data set comes from the TüBa-D/Z Treebank (Hinrichs et al. 2005),
a German newspaper corpus based on data taken from the daily issues of "die
tageszeitung" (taz).
Hand-annotated with inflectional morphology, constituent structure, grammatical
functions, and anaphoric and coreference relations.
Training: 415k words."
>> I guess for English it works at least, doesn't it?
right, just tried it - out of the box as a web demo
Original comment by eckle.kohler
on 3 Oct 2013 at 9:00
I'm not sure if I made my doubts regarding the mmax coref layer you previously
referred to sufficiently explicit. I'll retry (but I didn't do any further
investigation yet).
The min example contains two layers: markable, pos, and coref
The max example contains more layer: markable, chunk, enamex, lemma, morph,
parse, phrase, unit, and coref
I expect that coref is the output of BART while the other layers are input. So
I would expect that minimally, BART can work with pos information. If there is
morphological information in the coref layer, I would expect that to be ignored
or be generated by BART as part of the processing - but not prior to the
processing.
The "morph" layer in the max example appears to contain only lemma information
- in fact, it appears to be the same as the "lemma" layer.
There is a layer "markable" with additional semantic information in the max
example, but this information is not present in the min example (the layer is
there however).
So, I suppose that at least for English, it should be possible to get quite far
with the pre-processing components that we have, possibly including the
SemanticFieldAnnotator or something equivalent which may be included directly
with Brat (based on WordNet). Since English works out-of-the-box, there may be
some kind of morphological analysis included with BRAT as well (also based on
WordNet?).
Original comment by richard.eckart
on 4 Oct 2013 at 12:33
I did some code reading and think I have largely understood how BART works.
This can be discussed in one of the upcoming meetings.
Regarding http://code.google.com/p/dkpro-core-asl/issues/detail?id=258#c12
- you are right regarding the mmax layers
- for English, we will get worse results than BART end-to-end, if we use just
our preprocessing
- for German, we should employ POS tagging and parsing; but will probably get
much worse results than for English, because of the German language plugin
currently provided:
BART is kind of a knowledge-based system and the German language plugin is a
bit knowledge-poor (yet) compared to the English language plugin.
Original comment by eckle.kohler
on 4 Oct 2013 at 8:04
Original comment by richard.eckart
on 14 Aug 2014 at 10:05
Original comment by richard.eckart
on 22 Jan 2015 at 10:42
Original issue reported on code.google.com by
nico.erbs@gmail.com
on 1 Oct 2013 at 12:36