Add parent property to Token?

marcverhagen commented 8 years ago

This was first brought up in #26, but I made it a separate issue so it does not get buried.

Constituents have a parent feature, one that may be made mandatory. But the constituents list on PhraseStructure can also contain Tokens, which do not have a parent attribute. I see only ugly solutions: (1) add a parent feature to Token, (2) live with the fact that some elements in a tree have no parent, (3) let Token inherit from an element that has a parent feature, (4) use Constituents as leaf nodes, in which case we have no label and an empty children list, but we would have to add something like targets to refer to the Token and that means making Constituent a Region rather than a Relation.

The current situation is that we are following option 2.

reckart commented 8 years ago

This work if you assume that every sentence has only a single phrase structure - or at least that there is one canonical phrase structure which tokens refer to. If you think of having multiple phrase structures per sentence... but then I guess you don't.

In DKPro Core, we support only a single phrase structure per sentence (more than one would confuse our components), so the tokens can link up to that. But I noticed e.g. in the Stanford CoreNLP, they allow to have multiple constituent trees for different reasons, e.g. binarized and regular trees or k-best trees.

marcverhagen commented 8 years ago

It indeed seems to assume that each Token always has the same parent. But note that the Tokens could have been copied from the tokens view and that if the same Token occurs in two trees then we could have two copies. You get issues with having to change the ID but that is doable. (Note that the full identifier of a Token is the concatenation of the view identifier and the Token identifier).

reckart commented 8 years ago

I don't know... at the moment, I also only see ugly solutions, at least when it comes to supporting multiple phrase structures.

Views can be used as a mechanism to contain phrase structures and associated tokens, but it feels like it introduces redundancy. Maybe a treacherous feeling, so let's put that aside for a moment.

The indented semantics of views in LIF aren't clear to me. One version I have heard it might be considered a best practice that every new analysis component in a pipeline adds a new view and that existing views are never updated. But then again this also seems to introduce redundancies. Should it really be necessary to copy all the tokens that a tokenizer produced in order to add POS information? Now you appear to suggest another view semantics, namely to isolate different variations of a single annotation structure (phrase structure) from each other.

For the moment most perplexing it seems to me how an analysis component would choose which data to use as input. How does it determine which phrase structure variation to choose or which view to choose?

The variation to choose could be obtained e.g. by looking for a specific value in a "type" attribute (no such thing for phrase structure in LIF now, but dependency structures have a "dependencyType") - ok. But how to choose the view?

marcverhagen commented 8 years ago

Yes, the redundancy issue. At some point we allowed components to change existing views as they pleased, we restricted that a little bit by just allowing components to add information, but I am not wild about that idea either. There is a trade-off between redundancy and exactly knowing what was put where by whom. I may open a separate issue on this to collect the pros and cons again. By the way, people have argued to make POS its own first-class citizen and not an attribute of Token. There is something to that, but then the logical endpoint is for each attribute to be independent and that seemed to extreme for us and we did not know where to draw the line.

There is no one-to-one correspondence between views and annotation structures, one structure could be distributed over many views and one view can have many different kinds of annotations. In the metadata, the views are connected to producers and to the kinds of annotations that can be found in the view. I really see the views as connected to their producers and not to levels of annotation.

I don't think I am suggesting a different semantics for views and I don't think that views should isolate different annotations, but, yes, they can be used for that.

Finally, the issue of how an analysis component would choose which data to use as input. This is not yet implemented, although some bits and pieces are. But the idea is that each component specifies what it produces and what input it requires. Each service is required to provide a getMetadata() method which returns a JSON string that contains the requirements. The requirements are expressed with discriminators. So the service might say that it requires its input to have Token annotations, or perhaps even Token annotation created by some producer. Then the service wrapper's task is to find the view with those annotations, basically by checking the meta data for each view. This is trivial in simple pipelines, but it gets tricky when you have a LIF object with many views and many competing annotations of the same kind.

ksuderman commented 7 years ago

Back to the original issue... I don't think we should add a parent property to the Token type. Option 2 (what we do now) isn't that ugly and allows for Tokens to be a constituent of more than one thing.

lapps / vocabulary-pages

Add parent property to Token? #28