Adding morphology - Githubissues

marcverhagen commented 6 years ago

We have no place for morphology except by putting arbitrary attributes in the features dictionary.

Two options:

Add morphology as a property to Token with as its value a map. The disadvantage is that we have no way of specifying what features we want to use. The advantage is that this is the simplest way and that it does not increase the number of types n the vocabulary. In addition, the token seems to be a natural place to express this.
Add a Morphology annotation type which has an identifier that points at a Token or any other annotation type. Will allow us to specify what the morphological features are but at the cost of added complexity in he vocabulary. This is the approach that many others wanted us to use for parts-of-speech.

ksuderman commented 5 years ago

I prefer the first option as that is consistent with what we do for pos and lemma. Then maybe what we need is a way to describe the contents of complex feature values. For example, maybe the type could be the URL to a JSON schema.

keighrim commented 5 years ago

Binding morph to token as a single map structure might be problematic with those non-English examples where a single token decomposed to multi morphemes.

keighrim commented 5 years ago

Having a layered annotations for morphemes and morphology (or morphAnalysis) can solve the issue with anchoring morphemes to theirs surface forms (token, sentence, or any region). In this solution, morphAnalysis can hold an ordered list of morphemes and an feature structure object inside.

An example could be:

{
    "annotations": [
        { 
            "@type": "Token", 
            "id": "t1",
            "start":0 ,
            "end": 2,
            "features": {
                "word": "im"
            }
        },
        { 
            "@type": "Morph", 
            "id": "m1",
            "target": "t1", 
            "features": { 
                "list_of_morphemes": ["mor1", "mor2"], 
                "person": "m", 
                "case": "dative",
                ...
            }
        },
        { 
            "@type": "Morpheme", 
            "id": "mor1",
            "features": {
                "lemma": "in"
            }
        },
        { 
            "@type": "Morpheme", 
            "id": "mor1",
            "features": {
                "lemma": "dem"
            }
        }
    ]
}

marcverhagen commented 5 years ago

Some elaboration on the previous comment...

Having a Morphology annotation type allows us to associate morphological features with any other annotation or set of annotations (because target can refer to more than one element). And a morphemes feature on Morphology allows levels of Morphological analysis since you can have Morphology annotations with all their features pointing at morphemes.

As noted above, having those levels can deal with the "im == in dem" problem (the multi-word tokens from UD), as long as we are willing to live with calling "in" and "dem" morphemes, which I find moderately disturbing.

Having two annotation types agrees with WebLicht's TCF morphology tag which has an analysis part and a segmentation part (although it is not quite clear to me how the segmentation is used) and the DKPro type system (although there Morpheme has no description).

The mockup for the next vocabulary at http://vocab.lappsgrid.org/1.3.0-SNAPSHOT/ has Morphology as a subtype of Region. This is consistent in that we want a morphology to point to annotations via targets, but we vaguely hint that the targets should be a contiguous sequence. I see two problems there: (1) we may want a morphology to point at discontinuous annotations, and (2) some of the current uses of targets do have gaps between the individual annotations (for example the spaces between tokens in a sentence). Two potential solutions: (1) allow a Region to be discontinuous, (2) make Morphology a subtype of Annotation (like we do with Coreference, PhraseStructure and DependencyStructure, although issue #64 is suggesting to move those last two to Region).

Also in http://vocab.lappsgrid.org/1.3.0-SNAPSHOT/ we have a morph attribute on Token and the value of that attribute is a Morphology annotation. This may be a problem for two reasons:

A Morphology can be associated with any category (or at least with more than one) so do we want to define morph as a feature for all those categories?
Having Morphology embedded in another annotation breaks the unwritten rule that all annotations are put in a flat list. As remarked by Keith this rule may have been unintended. What we did consciously decide is that we would not have any tree structure (for example for phrase structures we just have a flat list of constituents with children attributes).

Finally, where in the hierarchy do we want to put Morpheme? Region seems to make some sense but in some cases there is no clear offsets that we can associate the Morpheme with and all we can do is have a target which will point out a wider region. Maybe we can have Morpheme point back to the Morphology that it is part of?

reckart commented 5 years ago

If I remember correctly, the Morpheme type in DKPro Core is deprecated and no longer used. It should be the predecessor of the MorphologicalAnalysis type that we have now.

In order to address things such as im = in dem, we have the following relevant issues in DKPro Core:

The idea is that we create two tokens on im, one with the form=in;order=1 and the other with the form=dem;order=2 - and then we attach a MorphologicalFeatures annotation to each of these Token annotations.

Mind that this is so far only a concept and has not been (fully) implemented yet. We do have the form feature already, but not the order.

marcverhagen commented 5 years ago

Yeah, WebLicht had a similar proposed solution and we have pondered several additions to Token which are also similar to yours. We had a few minor misgivings with that which I will try to remember.

lapps / vocabulary-pages

Adding morphology #65