Open marcverhagen opened 6 years ago
I prefer the first option as that is consistent with what we do for pos and lemma. Then maybe what we need is a way to describe the contents of complex feature values. For example, maybe the type
could be the URL to a JSON schema.
Binding morph
to token
as a single map structure might be problematic with those non-English examples where a single token decomposed to multi morphemes.
Having a layered annotations for morpheme
s and morphology
(or morphAnalysis
) can solve the issue with anchoring morphemes to theirs surface forms (token
, sentence
, or any region
). In this solution, morphAnalysis
can hold an ordered list of morpheme
s and an feature structure object inside.
An example could be:
{
"annotations": [
{
"@type": "Token",
"id": "t1",
"start":0 ,
"end": 2,
"features": {
"word": "im"
}
},
{
"@type": "Morph",
"id": "m1",
"target": "t1",
"features": {
"list_of_morphemes": ["mor1", "mor2"],
"person": "m",
"case": "dative",
...
}
},
{
"@type": "Morpheme",
"id": "mor1",
"features": {
"lemma": "in"
}
},
{
"@type": "Morpheme",
"id": "mor1",
"features": {
"lemma": "dem"
}
}
]
}
Some elaboration on the previous comment...
Having a Morphology annotation type allows us to associate morphological features with any other annotation or set of annotations (because target can refer to more than one element). And a morphemes feature on Morphology allows levels of Morphological analysis since you can have Morphology annotations with all their features pointing at morphemes.
As noted above, having those levels can deal with the "im == in dem" problem (the multi-word tokens from UD), as long as we are willing to live with calling "in" and "dem" morphemes, which I find moderately disturbing.
Having two annotation types agrees with WebLicht's TCF morphology tag which has an analysis part and a segmentation part (although it is not quite clear to me how the segmentation is used) and the DKPro type system (although there Morpheme has no description).
The mockup for the next vocabulary at http://vocab.lappsgrid.org/1.3.0-SNAPSHOT/ has Morphology as a subtype of Region. This is consistent in that we want a morphology to point to annotations via targets, but we vaguely hint that the targets should be a contiguous sequence. I see two problems there: (1) we may want a morphology to point at discontinuous annotations, and (2) some of the current uses of targets do have gaps between the individual annotations (for example the spaces between tokens in a sentence). Two potential solutions: (1) allow a Region to be discontinuous, (2) make Morphology a subtype of Annotation (like we do with Coreference, PhraseStructure and DependencyStructure, although issue #64 is suggesting to move those last two to Region).
Also in http://vocab.lappsgrid.org/1.3.0-SNAPSHOT/ we have a morph attribute on Token and the value of that attribute is a Morphology annotation. This may be a problem for two reasons:
A Morphology can be associated with any category (or at least with more than one) so do we want to define morph as a feature for all those categories?
Having Morphology embedded in another annotation breaks the unwritten rule that all annotations are put in a flat list. As remarked by Keith this rule may have been unintended. What we did consciously decide is that we would not have any tree structure (for example for phrase structures we just have a flat list of constituents with children attributes).
Finally, where in the hierarchy do we want to put Morpheme? Region seems to make some sense but in some cases there is no clear offsets that we can associate the Morpheme with and all we can do is have a target which will point out a wider region. Maybe we can have Morpheme point back to the Morphology that it is part of?
If I remember correctly, the Morpheme
type in DKPro Core is deprecated and no longer used. It should be the predecessor of the MorphologicalAnalysis
type that we have now.
In order to address things such as im
= in dem
, we have the following relevant issues in DKPro Core:
The idea is that we create two tokens on im
, one with the form=in;order=1
and the other with the form=dem;order=2
- and then we attach a MorphologicalFeatures
annotation to each of these Token
annotations.
Mind that this is so far only a concept and has not been (fully) implemented yet. We do have the form
feature already, but not the order
.
Yeah, WebLicht had a similar proposed solution and we have pondered several additions to Token which are also similar to yours. We had a few minor misgivings with that which I will try to remember.
We have no place for morphology except by putting arbitrary attributes in the features dictionary.
Two options:
Add
morphology
as a property toToken
with as its value a map. The disadvantage is that we have no way of specifying what features we want to use. The advantage is that this is the simplest way and that it does not increase the number of types n the vocabulary. In addition, the token seems to be a natural place to express this.Add a
Morphology
annotation type which has an identifier that points at aToken
or any other annotation type. Will allow us to specify what the morphological features are but at the cost of added complexity in he vocabulary. This is the approach that many others wanted us to use for parts-of-speech.