Closed GoogleCodeExporter closed 9 years ago
Currently the type "Compound" supports only two splits (part1 and part2) and is
even unable to annotate their begin and end positions.
I propose a new type system for compounds with three types:
- Compound extends Annotation
-- splits: List<Split>
- Split extends Annotation
-- linkingMorpheme : LinkingMorpheme
- LinkingMorpheme extends Annotation
This assumes that each Split has at most one following linking morpheme. I'm
not sure if this can be assumed.
Additionaly any compound splitter should support two modes:
- Annotate a token with the types defined above
- Split a token into new tokens that represent the splits either with trailing
linking morphemes or without (controllable via an option)
Original comment by richard.eckart
on 27 Jun 2012 at 7:58
This looks good.
>>This assumes that each Split has at most one following linking morpheme. I'm
not sure if this can be assumed.
I think it is safe to assume this, has also been assumed in this ACL paper on
language independent compound splitting
http://www.aclweb.org/anthology/P/P11/P11-1140.pdf
Judith
Original comment by eckle.kohler
on 29 Jun 2012 at 8:12
The above suggestion of the three types does not allow to model bracketing.
Maybe the following system would be better:
type Compound extends Annotation {
List<Split> splits;
}
type Split extends Annotation {
List<Split> splits;
String type [morpheme, linking-morpheme]
}
Depending on the configuration of the decompounding engine, this system allows
to model:
1) a flat splitting scheme: with one Compound and several splits inside. Each
split would not have additional children
2) a bracketed scheme: with one Compound at the root which has one child
covering the complete token which in turn recursively has 2-5 children (S S - S
L S - S S L S - S L S S - S L S L S) - this could be extended if the language
analyzed requires it. The type feature indicates if a split is a linking
morpheme or a split.
With the system above, there is also no restriction to a single linking
morpheme per split.
Original comment by richard.eckart
on 4 Aug 2012 at 4:47
I have committed the types as described above for now.
Original comment by richard.eckart
on 4 Aug 2012 at 6:56
Original comment by richard.eckart
on 13 Oct 2012 at 6:31
Original comment by richard.eckart
on 13 Oct 2012 at 6:33
Original comment by richard.eckart
on 21 Feb 2013 at 9:49
Issue 138 has been merged into this issue.
Original comment by richard.eckart
on 8 May 2013 at 1:44
Pedro: Currently DKPro-Core has a Compound type and a Split type. However,
there is no type representing a linking morpheme in DKPro-Core.
Original comment by richard.eckart
on 8 May 2013 at 1:45
In fact, after looking at the current types again (as explained eariler in this
issue), we have a Split with type "linking-morpheme" to represent that.
A problem though may be, that it's not easy to select(jcas,
LinkingMorpheme.class) or select(jcas, Morpheme.class). If we want that, we may
want to remove the "type" feature and instead add subclasses.
However, we still have some time before the release and can change that bit
later too. Better get going with the integration of the new decompounding
infrastructure ;)
Original comment by richard.eckart
on 8 May 2013 at 2:09
[deleted comment]
Besides that, it is also complicated to select the splits if the compound is
composed with more than 2 words. I.e., for the compound
"Doppelprozessormaschine", select(jcas, Split.class) will return 4 instances:
"Doppel","prozessormaschine", "prozessor","maschine". So I really don't know if
this Split type is the best way to represent a split.
Original comment by pedrobss...@gmail.com
on 10 May 2013 at 3:03
That depends on the strategy of mapping the split results to the CAS.
* flat strategy: the annotator produces just produces a Compound with 4 Split
elements in it. select(jcas, Split.class) will return 4 splits for the compound.
* tree strategy: the annotator produces a fully nested split tree. In that
case, you wouldn't want to use select(jcas, Split.class) of course. You'd do a
select(jcas, Compound.class) and then follow the split tree manually.
There may either be two different annotators, one for the "flat" annotation and
one for the "tree" annotation style, or it may be a configuration parameter
controlling the style.
Original comment by richard.eckart
on 10 May 2013 at 3:07
Yes, now it seems simpler. I haven't thought about the tree strategy. Good
point.
Original comment by pedrobss...@gmail.com
on 13 May 2013 at 10:27
So, is this issue fixed? Is the morphemes selection problem still an issue?
Original comment by pedrobss...@gmail.com
on 13 May 2013 at 10:37
We have a different problem now: two types called "Morpheme". We should avoid
having two type by the same name in different packages. We should try to get
some linguistically motivated suggestion on what the type names should be.
The "Morpheme" type for the splits is probably badly named (linking morpheme
should be ok). The other "Morpheme" type we have is likewise badly named. I
think both should be renamed.
Regarding the renaming the "Morpheme"-type we now have in the decompounding
system:
[1] calls these parts "heads"
[2] calls them "lexemes"
[3] calls them just "compound parts".
[1] http://www.aclweb.org/anthology/P/P11/P11-1140.pdf
[2] http://diotavelli.net/files/tmarek-linkmorphemes.pdf
[3] http://www.aclweb.org/anthology-new/P/P08/P08-2064.pdf
Original comment by richard.eckart
on 13 May 2013 at 6:24
I think CompoundPart would be a good one.
Original comment by pedrobss...@gmail.com
on 13 May 2013 at 9:17
Original comment by pedrobss...@gmail.com
on 16 May 2013 at 2:03
Original issue reported on code.google.com by
richard.eckart
on 27 Jun 2012 at 7:51