Closed proycon closed 5 years ago
Hi Maarten,
I wonder how we can express the constraints using the old more ad-hoc XML format as you mentioned above.
For example, in our current set definition of POS annotation, there is no explicit connection between a part-of-speech class (e.g. NUM
) and its morphological feature subsets (e.g. NUM_падеж
and NUM_род
) or between a group of part-of-speech classes (e.g. A
, ANUM
and APRO
) and their morphological feature subsets (e.g. A_падеж
, A_число
, A_форма
, A_род
, A_степень-сравнения
and A_у-л
). Could you show me how to make this kind of connection explicit in our set definition?
Best, Alex
Привет Алекс,
Whilst we did define some ways to define constraints in the legacy format, this was never really used nor implemented in the FoLiA libraries. So that's probably not the way to go. I see your problem, and I guess you want something that resonates in the FLAT interface? We should then work on this issue.
Большое спасибо! You read my mind. We should definitely work on this.
Here's a proposal for a new contraint mechanism in the FoLiA set definitions (it replaces the old mechanism that was once documented but never implemented nor used anywhere). It allows for defining constraints on which subsets can be used with which classes and which classes within subsets can be combined. Possibly relevant for @JesseDeDoes and @menzowindhouwer as well as we had some previous discussions on this.
FoLiA Set Definitions are currently defined in RDF using SKOS. SKOS has no mechanism to express such constraints. Whilst it could be probably be done in OWL or SHACL, set definitions merely define the vocabulary and it is the class instances in the FoLiA documents (which are not RDF) which use the vocabulary that needs to be validated against the constraints. So I want to opt for a simple yet flexible solution without needing a full semantic OWL parser.
I want to implement a new fsd:Contraint
class that holds one or more fsd:constrain
relations that refer to the ID of 1) a subset (skos:Collection), 2) a class (skos:Concept) or 3) a constraint (fsd:Constraint). The latter allows for nesting and complex constraints. The Constraint class would have a type
relation that can be set to 1) any (match any, i.e. a disjunction) 2) all (conjunction), 3) none (exclusion).
From individual sets/subsets (skos:Collection) and classes (skos:Concept), a constrain relation may be made.
Although the legacy format is deprecated, I'll implement it for that as well (since it's more accessible than RDF/SKOS for most users and I have a tool that converts it automatically anyway). I'll start with a Part-of-Speech example from @luutuntin there:
<set xml:id="birch_pos" type="closed" xmlns="http://ilk.uvt.nl/folia">
<class xml:id="N" label="N (существительное)" />
<constraint xml:id="c.N" type="any">
<constrain id="N" />
</constraint>
<subset xml:id="N_род" type="closed">
<class xml:id="m" label="муж" />
<class xml:id="f" label="жен" />
<class xml:id="mf" label="мж" />
<class xml:id="n" label="сред" />
<constrain id="c.N" />
</subset>
<subset xml:id="N_одушевленность" type="closed">
<class xml:id="anim" label="од" />
<class xml:id="inan" label="неод" />
<constrain id="c.N" />
</subset>
</set>
Both subsets here are constrained to be used when the main class is set to N. The constraint is defined once and used twice so we can avoid duplication, which makes sense in case of more complicated constraints (i.e. multiple constrain elements forming a real conjunction/disjunction). In this case, however, the whole case can be simplified:
<set xml:id="birch_pos" type="closed" xmlns="http://ilk.uvt.nl/folia">
<class xml:id="N" label="N (существительное)" />
<subset xml:id="N_род" type="closed">
<class xml:id="m" label="муж" />
<class xml:id="f" label="жен" />
<class xml:id="mf" label="мж" />
<class xml:id="n" label="сред" />
<constrain id="N" />
</subset>
<subset xml:id="N_одушевленность" type="closed">
<class xml:id="anim" label="од" />
<class xml:id="inan" label="неод" />
<constrain id="N" />
</subset>
</set>
An example of a slightly more complex constraint, a subset constrained to be used with either nouns or adjectives:
<class xml:id="N" label="Noun (существительное)" />
<class xml:id="A" label="Adjective (прилагательное)" />
<constraint xml:id="c.gender" type="any">
<constrain id="N" />
<constrain id="A" />
</constraint>
<subset xml:id="gender" type="closed">
<class xml:id="m" label="муж" />
<class xml:id="f" label="жен" />
<class xml:id="n" label="сред" />
<constrain id="c.gender" />
</subset>
Now the same thing in proper RDF/SKOS (with the custom extension in the fsd (FoLiA Set Definition) namespace):
example:N a skos:Concept ;
skos:notation "N" ;
skos:prefLabel "Noun (существительное)" .
example:A a skos:Concept ;
skos:notation "A" ;
skos:prefLabel "Adjective (прилагательное)" .
example:c.gender a fsd:Constraint ;
fsd:type fsd:any ;
fsd:constrain example:N ;
fsd:constrain example:A .
example:gender a skos:Collection ;
skos:member example:m, example:f, example:n ;
fsd:constrain example:c.gender .
Constraints may also be specified on individual classes (of either subsets or sets), which is only useful in case they can't be specified on the subset/set level as a whole:
<subset xml:id="gender" type="closed">
<class xml:id="m" label="муж">
<constrain id="c.gender" />
</class>
...
</subset>
example:m a skos:Concept ;
skos:notation "m" ;
skos:prefLabel "муж" ;
fsd:constrain example:c.gender .
It will be up to the FoLiA library to implement the necessary deep validation behaviour. The caveat is that it's up to set definition provider to make sure the constraints provided make sense, as it's not impossible to define contradictory constraints.
Any thoughts or comments?
Does this mean that fsd:constrain
relations don't have their own IDs? And if that is the case, how can we refer to a fsd:constrain
relation in a fsd:constraint
that holds multiple fsd:constrain
relations?
I'm considering the following example:
...
<class xml:id="A" label="Adjective (прилагательное)" />
<class xml:id="N" label="Noun (существительное)" />
<class xml:id="NUM" label="NUM (числительное)" />
<class xml:id="V" label="V (глагол)" />
...
<constraint xml:id="c.gender" type="any">
<constrain id="A" />
<constrain id="N" />
<constrain id="NUM" />
<constrain id="V" />
</constraint>
...
<subset xml:id="gender" type="closed">
<class xml:id="m" label="муж" /> <!-- only available for A, N, V -->
<!-- constrain ??? -->
<class xml:id="f" label="жен" />
<constrain id="c.gender">
<class xml:id="n" label="сред" /> <!-- only available for A, N, V -->
<!-- constrain ??? -->
<class xml:id="mf" label="мж" /> <!-- only available for N, V -->
<!-- constrain ??? -->
<class xml:id="mn" label="мс" /> <!-- only available for NUM -->
<!-- constrain ??? -->
</subset>
...
The constrain properties don't have IDs no, they refer to IDs instead. Constaint is the entity that holds one or more constain relations and has an ID.
So elaborating on your example, constraining individually on all of the classes of the subset, you'd get something like:
<constraint xml:id="c.ANV" type="any">
<constrain id="A" />
<constrain id="N" />
<constrain id="V" />
</constraint>
<constraint xml:id="c.NV" type="any">
<constrain id="N" />
<constrain id="V" />
</constraint>
<subset xml:id="gender" type="closed">
<class xml:id="m" label="муж">
<constrain id="c.ANV" /><!-- only available for A, N, V -->
</class>
<class xml:id="f" label="жен" />
<constrain id="c.ANV" /><!-- only available for A, N, V -->
</class>
<class xml:id="n" label="сред" />
<constrain id="c.ANV" /><!-- only available for A, N, V -->
</class>
<class xml:id="mf" label="мж" />
<constrain id="c.NV" /> <!-- only available for N, V -->
</class>
<class xml:id="mn" label="мс" />
<constrain id="NUM" /> <!-- only available for NUM (since it's just one mention and it is not reused we just point directly instead of via a constraint -->
</class>
</subset>
...
```xml
Thank you.
This has been implemented and released now, constraints will be validated as part of deep validation when running foliavalidator
.
Today, I set up a new instance of FLAT with the most updated configuration (i.e. FoLiA-Linguistic-Annotation-Tool-0.8.0, FoLiA-tools-2.1.1, folia-2.1.0, foliadocserve-0.7.0), and encountered the following error when trying to upload this file, whose set definition for pos-annotation
contains several constraints:
I encountered the same error when trying upload an older file, whose set definition for pos-annotation
does not contain any constraints. I managed to upload this older file when downgrading FoLiA-tools to 2.0.7 and folia to 2.0.8.
The library stumbled over the '+' in "Mystem+" when trying to create an ID from it (XML NCName IDs don't allow plus signs), this was a bug in foliapy which should now be fixed (v2.1.1). The issue is unrelated to the constraints.
Two things I noticed btw:
foliaupgrade
), there were some missing declarations. This was probably caused by your script loading an older FoLiA v1 document and then modifying it and saving it as v2? I'd suggest running foliaupgrade
on any FoLiA v1 documents first.
And last:
Continuation of INL/nederlab-linguistic-enrichment#17, @JessedeDoes wrote:
The old more ad-hoc XML format had facilities for this but we need modern RDF ones now and don't have any yet. Parsing these constraints would be needed for e.g. Frog (see LanguageMachines/frog#51), preferably without needing a complete OWL logic parser in Frog I'd say..