proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Allow setless and set-holding annotation types to coexist. #74

Closed proycon closed 4 years ago

proycon commented 5 years ago

Continuation of LanguageMachines/ucto#70: Certain annotation types can be setless or carry a set, but they can't co-exist. It seems we do have a use-case for this now though and the best solution is probably to make sure FoLiA can handle it.

I do think it should be possible, we can allow setless and set-holding declarations at the same time:

<chunk-annotation set="something" />
<chunk-annotation />

As we have multiple sets, there is no default. In the current implementation, I think a <chunk> annotation without explicit set declaration would produce a "set required" error. (I'll have to check this). But it is not ambiguous (it would pertain to the 2nd declaration) so it could be made valid.

proycon commented 5 years ago

I added some test documents: a and c are fairly trivial, b tests the situation with both a setless and set-holding type. All documents validate already with foliavalidator (but I'll have to verify example b does the right thing internally):

https://download.anaproy.nl/issue_folia_74_a.folia.xml https://download.anaproy.nl/issue_folia_74_b.folia.xml https://download.anaproy.nl/issue_folia_74_c.folia.xml

proycon commented 5 years ago

(somewhat related: proycon/flat#150)

kosloot commented 5 years ago

Ok, this raised a few questions concerning default processors and setnames. I constructed 4 new examples: A set_and_setless_explicit_a.2.1.0.folia.xml.txt B set_and_setless_explicit_b.2.1.0.folia.xml.txt C set_and_setless_explicit_c.2.1.0.folia.xml.txt D set_and_setless_explicit_d.2.1.0.folia.xml.txt

foliavalidator accepts only C, folialint accepts all, but B

So which is right? I assume folialint.

Arguments: A is correct, as processor P1 is defined and related to the empty set B is wrong , as processor P1 is NOT related to the set 'chunkset' C is correct, as processor P2 IS related to 'chunkset' D is correct, as P1 is related to the the empty set AND p2 to 'chunkset'

proycon commented 5 years ago

A is correct, as processor P1 is defined and related to the empty set

Agreed

B is wrong , as processor P1 is NOT related to the set 'chunkset'

Agreed

C is correct, as processor P2 IS related to 'chunkset'

Agreed

D is correct, as P1 is related to the the empty set AND p2 to 'chunkset'

Agreed