proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Tagging mechanism to aid processors #93

Closed proycon closed 3 years ago

proycon commented 3 years ago

I propose we introduce a generic tag attribute that allows people to tag any FoLiA element, the value being a space-delimited list of some undetermined vocabulary that is tool-specific. These tags can be used by FoLiA tools to help their processing. We're essentially encoding some extra 'cue' in the FoLiA to help another tool do its job, and such a cue may be needed because the information is not present in the FoLiA yet, or is too complexly encoded for the other tool to unravel.

A use case emerged from #88 where we need cues in untokenised FoLiA text to help the tokeniser determine where to force a token boundary:

<t>
  <t-str>item1<t-style tag="token"><feat class="superscript" subset="font_typeface"/>2</t-style></t-str><t-str>something</t-str>
</t>

We can also imagine a tool A that 'tags' specific elements given some complex search criteria, and a tool B that then operates on all elements that are tagged with a particular tag. Tags would here serve a function to help keep the two tools separated and specialised (unix philosophy).

The tags carry no intrinsic meaning for the FoLiA representation whatsoever (we have class for that already), they are merely signals to further tools in the processing chain.

We could then use a value like token or separate for the tokenisation cues:

proycon commented 3 years ago

Small addition: I think we should encourage processors to clean up the tags they 'consume', leaving the resulting FoLiA as clean as possible.