proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Add facilities for metadata on sub-parts of a document #30

Closed proycon closed 7 years ago

proycon commented 7 years ago

Allow metadata on structural elements (e.g. a part of the document rather than the whole document).

This is not explicitly possible yet now (although could be handled with a foreign-data block), a more explicit solution is desired.

Two proposed approaches: 1) In the main metadata section with references to elements (favoured by Katrien), this prevents duplication and keeps all metadata in one place but requires a referencing mechanism. 2) Allow <metadata> blocks in the structural elements themselves (e.g. in paragraphs, sentences, whatever structural unit is desired)

FoLiA's native metadata scheme is deliberately very simple, approach 1 would add some complexity to it. The idea behind keeping things simple is that we focus on annotations rather than metadata, as there are other schemes already in existence which handle metadata (e.g. CMDI, Dublin Core) and we didn't want to reinvent that wheel. However, referring to sub-parts is a valid FoLiA matter and existing schemes of metadata won't have facilities for it (as they're independent of FoLiA). Such a solution would have to be able to ne used in combination with whatever metadata scheme is used.

Approach 2 would perhaps be the most straightforward; tie in easily with other metadata schemes (as the metadata block may simply contain foreign-data and use CMDI, DC, or whatever). The more inline solution fits the FoLiA paradigm better on first glance, but raises the important issue of unnecessary duplication in case a block of metadata applies to various sub-parts.

The two approaches need not even be mutually exclusive, both could be implemented and choice deferred to the user, but this would introduce extra complexity for tools who don't know what to expect. and FoLiA aims to be rather specific to prevent that.

The boundary between metadata and annotation is not always a clear-cut one; whatever mechanism we introduce should not be used for linguistic annotation.

JessedeDoes commented 7 years ago

Unfortunately, things may get even messier. Metadata on structural elements is not always a possible solution: metadata may cross the boundaries of structural elements.

This is why, in TEI, we assign metadata to arbitrary text ranges.

Example:

<p> We were attacked by a giant <milestone xml:id="m0"/>dog </p> <p>with enormous <milestone xml:id="m1"/> teeth</p>

Here, the text from milestone m0 to milestone m1 may for instance be supplied by another author.

Katrien may be able to come up with a more realistic example.

kosloot commented 7 years ago

I would suggest a third approach: Allow references in structure elements that refer back to meta-data both internal or external to the document. Document internal meta-data is stored once in the meta-data section (which might need some extensions). External data can be anywhere, using an URL. FoLiA itself shouldn't have knowledge about what is is the meta-data, just provide you the link or the XML fragment. Several links can of course refer to the same meta-data fragment.

Maybe it is better to have external links indirectly: store those in the meta-data section too, and use internal links to refer to these links. This keeps all meta-data in one place.

One obvious drawback: finding all structure nodes that 'belong' to the same meta-data. But that is easily solved by creating in index on the fly when parsing the document. A simple API extension could then deliver all nodes connected to a certain meta-data part.

JanOdijk commented 7 years ago

Maarten/Ko, Can we also discuss what the CHILDES CHAT format allows in this respect and check that FoLiA covers these (or explicitly decide that it does not cover all of these or even none of them). There are some (I think) simple things such as speaker changing with every utterance (hence also the associated speaker characteristics) but there are also annotations on separate tiers and inside the transcription tier (speaking errors, phonetic representation of the pronunciation, hesitations, false starts and retracing, etc etc). I consider these mostly annotations, not metadata, but this distinction is not sharp and certainly not generally accepted.

For the short term I would like to know how to deal with the speaker (and speaker characteristics) changing with every new utterance in FoLiA.

Jan

proycon commented 7 years ago

@JanOdijk: Speaker information I indeed also consider annotation and not metadata, those are currently already catered for in FoLiA (there is a generic speaker attribute which can be used on structural elements in a speech context) so should not pose a problem, although it's not extensively tested in practise yet. Hesitations, false starts, and retracing might need a new FoLiA element; there already is a distortion element in a speech context but that might not be appropriate for those. We should discuss it in a new issue when the need arises, I don't consider it metadata as is in this issue.

hennie commented 7 years ago

I like Ko's suggestion. One of the (Nederlab) cases that gave rise to this issue is the need for author identification at level of text segments. Nederlab authors are complex entities with their own metadata and associations with titles, external to the FoLiA texts. And they have a unique identifier that can be used to refer to them.

kdepuydt commented 7 years ago

Uitgangspunt voor een diachroon corpus is dat ieder woord in de tekst de correcte metadata krijgt: correcte auteur, correcte datering... Wij proberen dit voor de teksten die we voor Nederlab aanleveren zo netjes mogelijk te doen. Alleen daarvoor ontbreken nog voorzieningen, zowel in Folia als in de database.

Maarten heeft mijn vraag goed verwoord, en uiteraard wil ik graag een mechanisme waarin ik op 1 plek metadata bijhoudt en verwijs naar gedeeltes in de tekst waar die "afwijkende" metadata op van toepassing zijn. Alleen kunnen die "afwijkingen" zich op verschillende wijze manifesteren.

Makkelijkste variant Pieter van Dam schrijft een geschiedenis en in de appendix bij de hoofdstukken bijlagen geeft hij met documenten (teksten) die niet van zijn hand zijn. Deze bijlagen hebben hun eigen metadata. Dit is een simpel geval, want er zijn corresponderende structuurelementen.

Variant 2 In het kleine deelproject (corpus 15e en 16e eeuw) wordt het al complexer. Een hofboek met aantekeningen kan zomaar voor een volgende zin een andere datering geven, die dan weer een tijdje geldt totdat er weer een nieuwe datum komt. Je wil hier geen aparte teksten van maken, maar ergens de tekst zo metadateren dat je kan zeggen: vanaf hier tot en met daar gelden die metadata. Hier is het een kwestie van datum, maar voor andere documenten heb je ook een indicatie van de hand (auteurswissel dus).

pag. 108- Ahoff Aº XX (=1520) Wessel then Horne betalt peper xvi pond und lersen Aº XXI op dach Divisionis Aplorum It Herbert to Holthuijs maegt to Hijginck oir hoffrecht nijet verwaert gewijset Bernt Mijrt up genaden des herrn It Egbert ten Kreijll wonende up dessen guijt sal in XIIII daeghen komen und betalen sijn gerechticheijt beij de Joncher und Drosten und sich dan ingeliveren lathen nae haves rechte hort in den hoff to Mijste. lijse sijn huijsfrouwe. It Nale ten culve gehijlickt bijnnen Bocholt staet tot bewijsnisse und vragen off sije in echte staat sijnt. Hoffgerichte Tegeder Tegeders hoff It Gebele wijll in XIIII scheijden van den Drosten, sijn wijff peper ende wass. Solvit peper und wass It Aº XXII Nale ten Wijnckell gehillickt ongescheiden. It Aº XXIII. herbert to Holthusen betalt 1 £ pepers Henricus Vaget, Henricus, Willem Portener, Kerstgen Wijbbolt, Bernt van Mijste, Dexx ten hurne, Egbert Elkijnck, Tebbe Smorckens, Gerbelt Smijt, Johann Gelijnck, Hermen Weert, Herman Stoteler, Bernt Bolijnck, Dirck Meerden, Schulte ten Ahave, Roert, Schulte Buckel, Essel Snaben Smedijnck It Roerdinck den schadeloiss brieff 't maeken It dije Gijldemesters uthn Wolde hebben benompt It Kaeten benompt It Raetman benompt It Huppel Henxsell und Raetman

Variant 3; additioneel probleem. In een teksteditie zijn woorden of zinnen toevoegingen van een editeur. In de TEI heb je daar codering voor (resp=editor, hangend aan een structuurelement; of add resp=editor, del resp=editor...). Denk ook maar aan de voetnotensectie bij teksten, die meestal van de editeur is.

In de conversie naar folia verdwijnt nu deze informatie, ook in de reeds aangeleverde bestanden. Strikt genomen is dat niet goed.

Toen wij op het INT een aantal jaren geleden aan de slag gingen met een selectie van de DBNL, en wij gebruiken TEI, hebben we voor 1 mechanisme gekozen voor variant 1 en 2, namelijk door het neerzetten van milestones (tags die overal in de structuur kunnen staan waarmee we begin en eind van een stuk kunnen aangeven); in de header met metadata zag dat er dan zo uit:

Het voorbeeld komt uit Bredero Liedboek, dat voorin gedichten heeft van andere auteurs: Voorin staan de metadata die op het geheel slaan, en daarna de metadata specifiek voor stukken tekst

</p><listBibl id="inlMetadata"><bibl id="dbnl-bred001groo01_01"><interpGrp type="title.level1"><interp value="Groot lied-boeck" type="main"/></interpGrp><interpGrp type="author.level1"><interp value="G.A. Bredero"/></interpGrp><interpGrp type="editor"><interp value="editie G. Stuiveling e.a."/></interpGrp><interpGrp type="date.publication"><interp value="1975"/><interp value="1983"/><interp value="1979"/></interpGrp><interpGrp type="dbnl-datumcontrole"><interp value="G.A. Bredero, Groot lied-boeck, 3 delen, editie G. Stuiveling, A. Keersmaekers, C.F.P. Stutterheim, F. Veenstra en C.A. Zaalberg (deel I); G. Stuiveling, A. Keersmaekers, C.F.P. Stutterheim, F. Veenstra, C.A. Zaalberg en P.J.J. van Thiel (deel II) en F.H. Matter (deel III). Tjeenk Willink-Noorduijn, Culemborg 1975 (deel I) / Martinus Nijhoff, Leiden 1983 (deel II) / Tjeenk Willink-Noorduijn, Den Haag 1979 (deel III)  "/></interpGrp><interpGrp type="idno"><interp value="dbnl-bred001groo01_01"/></interpGrp></bibl></listBibl><listBibl id="dbnl-specific-metadata" default="NO"><bibl id="interp_bred001groo01_1" default="NO"> <interpGrp type="textYear_from"><interp value="1616"/></interpGrp> <interpGrp type="textYear_to"><interp value="1616"/></interpGrp> <interpGrp type="witnessYear_from"><interp value="1622"/></interpGrp> <interpGrp type="witnessYear_to"><interp value="1622"/></interpGrp> <interpGrp type="authors"><interp value="G.A. Bredero"/></interpGrp> <biblScope> <xref from="milestone_bred001groo01_bo_1" to="milestone_bred001groo01_eo_1" targOrder="U"/> </biblScope> </bibl><bibl id="interp_bred001groo01_2" default="NO"> <interpGrp type="textYear_from"><interp value="1616"/></interpGrp> <interpGrp type="textYear_to"><interp value="1616"/></interpGrp> <interpGrp type="witnessYear_from"><interp value="1622"/></interpGrp> <interpGrp type="witnessYear_to"><interp value="1622"/></interpGrp> <interpGrp type="authors"><interp value="C. Aerssens"/></interpGrp> <biblScope> <xref from="milestone_bred001groo01_bo_2" to="milestone_bred001groo01_eo_2" targOrder="U"/> </biblScope>

We wilden niet met een mix van id's bij structuurelementen in het ene geval werken, en met milestones in het geval de metadata niet samenvielen met een structuurelement.

De discussie metadata / annotatie begrijp ik wel: in principe is elke informatie die iets zegt over een woord in de tekst een annotatie. Alleen inhoudelijk valt m.i. wel degelijk een verschil te maken tussen de types informatie die als verrijking wordt meegegeven, en dan zou ik metadata toch scheiden van andere types annotatie.

proycon commented 7 years ago

Proposal for submetadata

Thanks for all the feedback. Here is a proposal mostly in line with Ko's suggestion, and hopefully accommodating everybody's needs:

So this implies that all metadata is together in the document header, if there are references to external metadata sources, then these are also explicit in the header. The references, however, flow from the document to the header rather than vice versa. The is in line with the FoLiA principle to keep things as local as possible, allowing people to readily identify if a particular section they are looking at has particular submetadata associated with it. It also facilitates the job of simple parsers who can quickly obtain all elements a submetadata block applies to with an Xpath expression.

In anticipation of certain questions: the milestone approach is interesting but would have some problems for FoLiA. Milestones would occur either INSIDE the text (inside <t>) or between structural elements. The latter renders the need for milestones obsolete this would imply there are structural elements which cover the content anyway. The former is problematic because there can be multiple text layers (think of e.g. historical layer vs. modernized), no text layers at all (think of speech), or redundancy in text layers (expressed at multiple levels). Moreover, this current proposal allows (sub)metadata to be associated with anything, not just text, hopefully preventing any future situation where we find that we can't sufficiently express metadata.

An example excerpt (details omitted) of how this would look:

<FoLiA>
<metadata>
 <annotations>...</annotations>
 <submetadata xml:id="metadata.1" type="native">
   <meta id="author">proycon</meta>
   <meta id="language">nld</meta>
 </submetadata>
 <submetadata xml:id="metadata.2" type="native">
   <meta id="author">Shakespeare</meta>
   <meta id="language">eng</meta>
 </submetadata>
</metadata>
<text>
 <p metadata="metadata.1">
   <t>Het volgende vers komt uit Hamlet:</t>
 </p>
 <p metadata="metadata.2">
  <s><t>To be, or not to be, that is the question:</t></s>
  <s><t>Whether 'tis nobler in the mind to suffer<br/>The slings and arrows of outrageous fortune,<br/>Or to take Arms against a Sea of troubles,<br/> And by opposing end them:</s></t>
 </p>
</text>
</FoLiA>

Since metadata can be associated with anything, any arbitrary sub-parts of untokenised text can be selected and associated with the existing facilities <str> or <t-str>. Some redundancy takes place only when structural boundaries are crossed (the metadata element might have to be repeated on multiple structural elements if there is no catch-all structure).

What do you think of this proposal? Does this cover all use-cases?

kdepuydt commented 7 years ago

Dear Maarten, I think we are almost there. Could you please explain how we should add metadata in variant 2 of my previous comment? There, there is no structural element I could attach the reference to.

proycon commented 7 years ago

I don't know what kind of structural elements you have in that particular example, but all cases can be made to work. In case you have something sentence structure but no overarching paragraph, division or whatever that would be the most appropriate level to associate the metadata; you can simply refer to the metadata from each sentence. So building on my previous example, instead of:

 <p metadata="metadata.2">
  <s><t>To be, or not to be, that is the question:</t></s>
  <s><t>Whether 'tis nobler in the mind to suffer<br/>The slings and arrows of outrageous fortune,<br/>Or to take Arms against a Sea of troubles,<br/> And by opposing end them:</s></t>
 </p>

You can also do, for instance if there's no <p> or other structure to attach it to:

  <s metadata="metadata.2"><t>To be, or not to be, that is the question:</t></s>
  <s metadata="metadata.2"><t>Whether 'tis nobler in the mind to suffer<br/>The slings and arrows of outrageous fortune,<br/>Or to take Arms against a Sea of troubles,<br/> And by opposing end them:</s></t>

If you don't have sentences but words/tokens, then you can do it at that level. But it's most efficient to group things in bigger yet sensible structural units of course, whatever they may be.

If the whole text is part of a big untokenised chunk of text for which any further structure has not yet been determined, then you can use the <str> or <t-str> elements to any mark arbitrary parts of it (see section 2.10.13 of the FoLiA documentation). But the use of proper structure elements is always preferred if possible and a requirement for deeper linguistic annotation! An example of this scenario:

<text>
   <t><t-str metadata="metadata.1">Het volgende vers komt uit Hamlet</t-str><br/>
   <t-str metadata="metadata.2">To be, or not to be, that is the question:<br/>Whether 'tis nobler in the mind to suffer<br/>The slings and arrows of outrageous fortune,<br/>Or to take Arms against a Sea of troubles,<br/> And by opposing end them:</t-str></t>
</text>

Does this answer your question?

hennie commented 7 years ago

"If you don't have sentences but words/tokens, then you can do it at that level. But it's most efficient to group things in bigger yet sensible structural units of course, whatever they may be."

If I understand Katrien correctly she wants to associate metadata with segments that go across boundaries of the XML structure, and that are potentially very large. Milestones allow you to identify such segments. To annotate a long sequence of tokens using your proposal would require one to replicate the metadata attribute for each token of the sequence. Is it feasible and compliant with FoLiA design principles to identify a sequence of tokens using the id's of begin and end token, and and in that way associate a metadata attribute with the sequence only once?

proycon commented 7 years ago

I see the issue yes, but I think it can be remedied in other more FoLiA-like ways, although a small amount of duplication may indeed occur in certain cases where structural boundaries are crossed.

Taking your example of a large amount of word tokens, of which a large subset gets different metadata: The range can be marked by simply introducing a new structural element, if there is no proper semantic choice such as paragraph, sentence, division (e.g. chapter/section/subsection), event.... then one can always fall back to the <part> structural element, which is a kind of a catch-all solution. Assume we start with text that has mere tokens, possibly some linebreaks, but no further structure at all:

<text>
  <w><t>Het</t></w>
  <w><t>volgende</t></w>
  <w><t>vers</t></w>
  <w><t>komt</t></w>
  <w><t>uit</t></w>
  <w><t>Hamlet:</t></w>
  <br />
  <w><t>To<t/></w>
  <w><t>be<t/></w>
  <w><t>or<t/></w>
  <w><t>not<t/></w>
  <w><t>to<t/></w>
  <w><t>be<t/></w>
</text>

We can then simply introduce a <part> structure element (but preferably a semantically more informed choice if possible!) to group structure and assign it metadata once:

<text>
  <part metadata="metadata.1">
   <w><t>Het</t></w>
   <w><t>volgende</t></w>
   <w><t>vers</t></w>
   <w><t>komt</t></w>
   <w><t>uit</t></w>
   <w><t>Hamlet:</t></w>
  </part>
  <br />
  <part metadata="metadata.2">
   <w><t>To<t/></w>
   <w><t>be<t/></w>
   <w><t>or<t/></w>
   <w><t>not<t/></w>
   <w><t>to<t/></w>
   <w><t>be<t/></w>
 </part>
</text>

FoLiA does not currently use any referencing system that refers to a begin and end, but opts for explicit references of the entire range (also in e.g. span annotation; consider we also support discontinuous spans). FoLiA tries to make use of the hierarchy of XML wherever possible. Hence my reluctance to opt for ranges where the burden is shifted to the client to resolve reference and then iterate over it (which is not even as trivial as it might seem at first), complicating retrieval.

kosloot commented 7 years ago

I think Maartens proposal covers the cases the best (or completely.., hard to tell), keeping in mind the design choices in FoLiA. The element should be avoided when possible. But used when nothing else is availeble.

Ko

On 07/13/2017 02:11 PM, Maarten van Gompel wrote:

I see the issue yes, but I think it can be remedied in other more FoLiA-like ways, although a small amount of duplication may indeed occur in certain cases where structural boundaries are crossed.

Taking your example of a large amount of word tokens, of which a large subset gets different metadata: The range can be marked by simply introducing a new structural element, if there is no proper semantic choice such as paragraph, sentence, division (e.g. chapter/section/subsection), event.... then one can always fall back to the || structural element, which is a kind of a catch-all solution. Assume we start with text that has mere tokens, possibly some linebreaks, but no further structure at all:

Het volgende vers komt uit Hamlet:
To be or not to be

We can then simply introduce a || structure element (but preferably a semantically more informed choice if possible!) to group structure and assign it metadata once:

Het volgende vers komt uit Hamlet:
To be or not to be

FoLiA does not currently use any referencing system that refers to a begin and end, but opts for explicit references of the entire range (also in e.g. span annotation; consider we also support discontinuous spans). FoLiA tries to make use of the hierarchy of XML wherever possible. Hence my reluctance to opt for ranges where the burden is shifted to the client to resolve reference and then iterate over it (which is not even as trivial as it might seem at first), complicating retrieval.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/proycon/folia/issues/30#issuecomment-315059700, or mute the thread https://github.com/notifications/unsubscribe-auth/AFmRwYYEV8V1Rk3Gd-yIjicqIvaci-Beks5sNgmBgaJpZM4OEBOM.

kdepuydt commented 7 years ago

Hi Maarten, I think I understand the reason for the solutions you suggest. From a data producing perspective, this is quite terrible,. You have to know that for the processing of the texts for Nederlab, manual correction of the XML-encoding is done. The reason why we chose milestones in TEI was to avoid having to introduce a structural element like seg, that had to be repeated within different structural elements to indicate a specific section with metadata that does not properly nest with the rest of the XML. Now are we using TEI for the processing of the texts and the Folia format is reached by automatic conversion. So this would only bother Jesse who does the TEI to Folia conversion. But having this solution for people who would like to start with Folia straight away is not ideal :-(.

There are several layers of information one wants to give to a text, and which is indicated separately in Folia. Why has there not been chosen for a solution in which in layer 1: original text (complete), this is the core text, containing structural encoding and milestones (metadata) (or layer 1b with structural encoding, layer 1 c with metadatannotations) layer 2: Ticcl layer 3: linguistic annotation etc....

proycon commented 7 years ago

(Proposal accepted after skype call)

proycon commented 7 years ago

Implementation in Python library and FoLiA tools is ready and available in master branch (pending release).