proycon / foliapy

An extensive Python library for dealing with FoLiA (Format for Linguistic Annotation) documents, a rich XML-based format for linguistic annotation finding application in Natural Language Processing (NLP). This library was formerly part of PyNLPl.
https://proycon.github.io/folia
GNU General Public License v3.0
18 stars 5 forks source link

Should span annotations list wrefs in order? #5

Closed asharkinasuit closed 3 years ago

asharkinasuit commented 6 years ago

It seems the documentation is silent on this, but since the word ids are included, the order technically doesn't matter. Right now, for instance, an Entity prints its words in the order in which they occur in the XML tree as child of the <entity> tag, not in the order in which they occur as part of the Sentence.

If the order does not matter, it is easier to modify a given entity: just add the missing words. If it matters, you have to be careful about where each word goes in the tree, and I'm not sure it is possible to control that using just the add function.

Edit: I see there's also an insert function that should do nicely for my last point 😃 Edit 2: ... except that one is actually inherited from the base class AbstractElement, which appears to be stricter about what kind of children it allows in. That is, the "automagical" acceptance that is provided in the AbstractSpanAnnotation.add method is not granted in the AbstractElement.insert. It would be nice if insert were also overridden in AbstractSpanAnnotation to support this. Edit 3: Apparently WordReference objects are allowed to be inserted. Doing that seems to work, except that the entity's text method doesn't seem to take into account bare WordReference objects in its data list. Maybe this is because you're not supposed to construct WordReference objects yourself, maybe this is a separate issue... A related question would be: since WordReference only seems to offer the id of the word, how does one generally get the word for an id? I would have thought XPath should do, but the tree needed for that is unloaded by default.

proycon commented 6 years ago

Good question! I should probably make this more explicit indeed. Although this is not explicitly checked currently, the order of the word references should indeed be as it occurs in the text, otherwise some unexpected behaviour might occur.

It's indeed best not to create WordReferences yourself indeed, the API will do it for you when you pass Word instances. I don't really fully understand what's happening in your edit 3 situation yet.

An overridden insert method as you suggest sounds like a good idea to solve the problem indeed.

As a more nasty low-level workaround, you could also clear all data in the entity (entity.data = []), and then readd all the words.

asharkinasuit commented 6 years ago

The situation with edit 3 is that I completed Entities that lacked words that occur in my gold annotations, using WordReferences, but then if you call text() on the entity, it only prints the words that are actual Words, not the rest. The problem with using add and letting it do the conversion to WordReference is that it doesn't allow you to specify the place where you want the word to be added, the way insert does.

My hack right now is to include WordReferences anyway but to manually add a txt property that I later read out. I guess resetting the data property would be slightly cleaner...