proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

FoLiA set definitions currently can't express constraints #50

Closed proycon closed 5 years ago

proycon commented 6 years ago

Continuation of INL/nederlab-linguistic-enrichment#17, @JessedeDoes wrote:

I see no obvious way in SkoS of declaring that feature f can be combined with PoS p, etc. (You could express that "Masculine" is a narrower class than "having a gender feature", etc, how would one express, eg, that "having a number feature" is a subset of the union of TW,WW,N,VNW. It would be possible in OWL)

The old more ad-hoc XML format had facilities for this but we need modern RDF ones now and don't have any yet. Parsing these constraints would be needed for e.g. Frog (see LanguageMachines/frog#51), preferably without needing a complete OWL logic parser in Frog I'd say..

luutuntin commented 5 years ago

Hi Maarten,

I wonder how we can express the constraints using the old more ad-hoc XML format as you mentioned above. For example, in our current set definition of POS annotation, there is no explicit connection between a part-of-speech class (e.g. NUM) and its morphological feature subsets (e.g. NUM_падеж and NUM_род) or between a group of part-of-speech classes (e.g. A, ANUM and APRO) and their morphological feature subsets (e.g. A_падеж, A_число, A_форма, A_род, A_степень-сравнения and A_у-л). Could you show me how to make this kind of connection explicit in our set definition?

Best, Alex

proycon commented 5 years ago

Привет Алекс,

Whilst we did define some ways to define constraints in the legacy format, this was never really used nor implemented in the FoLiA libraries. So that's probably not the way to go. I see your problem, and I guess you want something that resonates in the FLAT interface? We should then work on this issue.

luutuntin commented 5 years ago

Большое спасибо! You read my mind. We should definitely work on this.

proycon commented 5 years ago

Here's a proposal for a new contraint mechanism in the FoLiA set definitions (it replaces the old mechanism that was once documented but never implemented nor used anywhere). It allows for defining constraints on which subsets can be used with which classes and which classes within subsets can be combined. Possibly relevant for @JesseDeDoes and @menzowindhouwer as well as we had some previous discussions on this.

FoLiA Set Definitions are currently defined in RDF using SKOS. SKOS has no mechanism to express such constraints. Whilst it could be probably be done in OWL or SHACL, set definitions merely define the vocabulary and it is the class instances in the FoLiA documents (which are not RDF) which use the vocabulary that needs to be validated against the constraints. So I want to opt for a simple yet flexible solution without needing a full semantic OWL parser.

Technical description

I want to implement a new fsd:Contraint class that holds one or more fsd:constrain relations that refer to the ID of 1) a subset (skos:Collection), 2) a class (skos:Concept) or 3) a constraint (fsd:Constraint). The latter allows for nesting and complex constraints. The Constraint class would have a type relation that can be set to 1) any (match any, i.e. a disjunction) 2) all (conjunction), 3) none (exclusion).

From individual sets/subsets (skos:Collection) and classes (skos:Concept), a constrain relation may be made.

Examples

Although the legacy format is deprecated, I'll implement it for that as well (since it's more accessible than RDF/SKOS for most users and I have a tool that converts it automatically anyway). I'll start with a Part-of-Speech example from @luutuntin there:

<set xml:id="birch_pos" type="closed" xmlns="http://ilk.uvt.nl/folia">
    <class xml:id="N" label="N (существительное)" />

    <constraint xml:id="c.N" type="any">
        <constrain id="N" />
    </constraint>

    <subset xml:id="N_род" type="closed">
      <class xml:id="m" label="муж" />
      <class xml:id="f" label="жен" />
      <class xml:id="mf" label="мж" />
      <class xml:id="n" label="сред" />
      <constrain id="c.N" />
    </subset>

    <subset xml:id="N_одушевленность" type="closed">
      <class xml:id="anim" label="од" />
      <class xml:id="inan" label="неод" />
      <constrain id="c.N" />
    </subset>
</set>

Both subsets here are constrained to be used when the main class is set to N. The constraint is defined once and used twice so we can avoid duplication, which makes sense in case of more complicated constraints (i.e. multiple constrain elements forming a real conjunction/disjunction). In this case, however, the whole case can be simplified:

<set xml:id="birch_pos" type="closed" xmlns="http://ilk.uvt.nl/folia">
    <class xml:id="N" label="N (существительное)" />

    <subset xml:id="N_род" type="closed">
      <class xml:id="m" label="муж" />
      <class xml:id="f" label="жен" />
      <class xml:id="mf" label="мж" />
      <class xml:id="n" label="сред" />
      <constrain id="N" />
    </subset>

    <subset xml:id="N_одушевленность" type="closed">
      <class xml:id="anim" label="од" />
      <class xml:id="inan" label="неод" />
      <constrain id="N" />
    </subset>
</set>

An example of a slightly more complex constraint, a subset constrained to be used with either nouns or adjectives:

    <class xml:id="N" label="Noun (существительное)" />
    <class xml:id="A" label="Adjective (прилагательное)" />

    <constraint xml:id="c.gender" type="any">
        <constrain id="N" />
        <constrain id="A" />
    </constraint>

    <subset xml:id="gender" type="closed">
      <class xml:id="m" label="муж" />
      <class xml:id="f" label="жен" />
      <class xml:id="n" label="сред" />
      <constrain id="c.gender" />
    </subset>

Now the same thing in proper RDF/SKOS (with the custom extension in the fsd (FoLiA Set Definition) namespace):

example:N a skos:Concept ;
    skos:notation "N" ;
    skos:prefLabel "Noun (существительное)" .

example:A a skos:Concept ;
    skos:notation "A" ;
    skos:prefLabel "Adjective (прилагательное)" .

example:c.gender a fsd:Constraint ;
    fsd:type fsd:any ;
    fsd:constrain example:N ;
    fsd:constrain example:A .

example:gender a skos:Collection ;
    skos:member example:m, example:f, example:n ;
    fsd:constrain example:c.gender .

Constraints may also be specified on individual classes (of either subsets or sets), which is only useful in case they can't be specified on the subset/set level as a whole:

    <subset xml:id="gender" type="closed">
      <class xml:id="m" label="муж">
          <constrain id="c.gender" />
      </class>
      ...
    </subset>
example:m a skos:Concept ;
    skos:notation "m" ;
    skos:prefLabel "муж" ;
    fsd:constrain example:c.gender .

It will be up to the FoLiA library to implement the necessary deep validation behaviour. The caveat is that it's up to set definition provider to make sure the constraints provided make sense, as it's not impossible to define contradictory constraints.

Any thoughts or comments?

luutuntin commented 5 years ago

Does this mean that fsd:constrain relations don't have their own IDs? And if that is the case, how can we refer to a fsd:constrain relation in a fsd:constraint that holds multiple fsd:constrain relations? I'm considering the following example:

...
    <class xml:id="A" label="Adjective (прилагательное)" />
    <class xml:id="N" label="Noun (существительное)" />
    <class xml:id="NUM" label="NUM (числительное)" />
    <class xml:id="V" label="V (глагол)" />
...
    <constraint xml:id="c.gender" type="any">
        <constrain id="A" />
        <constrain id="N" />
        <constrain id="NUM" />
        <constrain id="V" />
    </constraint>
...
    <subset xml:id="gender" type="closed">
        <class xml:id="m" label="муж" /> <!-- only available for A, N, V --> 
            <!-- constrain ??? -->
        <class xml:id="f" label="жен" />
            <constrain id="c.gender">
        <class xml:id="n" label="сред" /> <!-- only available for A, N, V -->
            <!-- constrain ??? -->
        <class xml:id="mf" label="мж" /> <!-- only available for N, V -->
            <!-- constrain ??? -->
        <class xml:id="mn" label="мс" /> <!-- only available for NUM -->
            <!-- constrain ??? -->
    </subset>
...
proycon commented 5 years ago

The constrain properties don't have IDs no, they refer to IDs instead. Constaint is the entity that holds one or more constain relations and has an ID.

So elaborating on your example, constraining individually on all of the classes of the subset, you'd get something like:

   <constraint xml:id="c.ANV" type="any">
        <constrain id="A" />
        <constrain id="N" />
        <constrain id="V" />
    </constraint>

   <constraint xml:id="c.NV" type="any">
        <constrain id="N" />
        <constrain id="V" />
    </constraint>

    <subset xml:id="gender" type="closed">
        <class xml:id="m" label="муж">
          <constrain id="c.ANV" /><!-- only available for A, N, V --> 
        </class>

        <class xml:id="f" label="жен" />
          <constrain id="c.ANV" /><!-- only available for A, N, V --> 
        </class>

        <class xml:id="n" label="сред" /> 
          <constrain id="c.ANV" /><!-- only available for A, N, V --> 
        </class>

        <class xml:id="mf" label="мж" />
         <constrain id="c.NV" /> <!-- only available for N, V -->
        </class>

        <class xml:id="mn" label="мс" />
         <constrain id="NUM" /> <!-- only available for NUM (since it's just one mention and it is not reused we just point directly instead of via a constraint -->
        </class>
    </subset>
...
```xml
luutuntin commented 5 years ago

Thank you.

proycon commented 5 years ago

This has been implemented and released now, constraints will be validated as part of deep validation when running foliavalidator.

luutuntin commented 5 years ago

Today, I set up a new instance of FLAT with the most updated configuration (i.e. FoLiA-Linguistic-Annotation-Tool-0.8.0, FoLiA-tools-2.1.1, folia-2.1.0, foliadocserve-0.7.0), and encountered the following error when trying to upload this file, whose set definition for pos-annotation contains several constraints: folia_2 1 0 I encountered the same error when trying upload an older file, whose set definition for pos-annotation does not contain any constraints. I managed to upload this older file when downgrading FoLiA-tools to 2.0.7 and folia to 2.0.8.

proycon commented 5 years ago

The library stumbled over the '+' in "Mystem+" when trying to create an ID from it (XML NCName IDs don't allow plus signs), this was a bug in foliapy which should now be fixed (v2.1.1). The issue is unrelated to the constraints.

Two things I noticed btw:

And last: