Token annotations cannot carry syntactic function

GoogleCodeExporter commented 9 years ago

The Stanford Parser produces syntactic functions for terminals (tokens), e.g. 
NN-DA or NN-DO (noun + function). In the Consituent annotation, we can maintain 
the syntactic function in the CAS, but the Token annotation does not have that. 

Probably the "Token" type should have a syntactic function feature as well?

Original issue reported on code.google.com by richard.eckart on 30 Jun 2013 at 5:03

GoogleCodeExporter commented 9 years ago

Is there some documentation about this syntactic functions somewhere? I 
couldn't find something in a quick search.

Original comment by torsten....@gmail.com on 4 Sep 2013 at 12:38

GoogleCodeExporter commented 9 years ago

Yep, here: 
http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/kanten.html
The pos tags are here: 
http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/stts.asc

E.g. NN-DA = normal noun (NN) + dative (DA)

The connection between the token layer and the constituent layer is a bit 
awkward at the moment. E.g. the "parent" feature in the token is of type 
Annotation because the token is in the segmentation API while the type 
Constituent is in the syntax API (and syntax depends on segmentation, so 
segmentation cannot depend on syntax). We might want to consider if the token 
doesn't really belong to the syntax API, or if we can find a way that the 
syntax API doesn't depend on the segmentation API. Btw. the dependency is 
introduced because the Dependency type uses Token as its endpoints. So if we 
move Dependency somewhere else... e.g. to "api.syntax.dependency".

Original comment by richard.eckart on 4 Sep 2013 at 1:01

GoogleCodeExporter commented 9 years ago

I might miss a point, but I don's see why token should carry syntactic function 
at all.
But I guess this is the old discussion about features instead of offset bound 
retrieval :)
So if you think it makes sense to have it in token, I am fine with it.

Original comment by torsten....@gmail.com on 4 Sep 2013 at 1:12

GoogleCodeExporter commented 9 years ago

Here is my current opinion on the matter of offsets vs. features:

Offsets are a good starting point, in particular if it is not clear how often a 
navigation path is used, if extensibility is an issue, and if one is not 
familiar yet with the details of what is to be annotated.

Features are good if it is known that a navigation path is used often (and 
should be reasonably fast), it it is known that extensibility is not a problem, 
which entails that there is a good familiarity with what is to be annotated.

In this issue, we have the case that we know there are syntactic function 
labels on edges between constituents. There is a corresponding feature in the 
Constituent type (although, admittedly, afaik we don't use it much). We treat 
Tokens as terminals in the constituency structure, but in fact, in our type 
system, Token does not inherit from Constituent and thus is not a Constituent. 
So we have a conceptual problem here:

- on the one hand, we treat Token as a terminal in the constituency structure, 
which means that there is an edge between the Token and the constituent above. 
Such an edge should allow for a syntactic function label.

- on the other hand, Token *is not* a Constituent. It it was, it would 
automatically inherit the "syntacticFunction" feature from the Constituent type.

So... is the Token a constituent or not?

If it is, then it should probably inherit from Consituent.

If it is not, then we should probably change our parser wrappers so, that an 
additional terminal constituent is introduced in the constituency tree which 
can bear the syntactic function that would otherwise be associated directly 
with the token.

Original comment by richard.eckart on 4 Sep 2013 at 2:17

GoogleCodeExporter commented 9 years ago

Nice summary.

In my world (TM), a Token is not a constituent.
So I would vote for introducing an additional terminal constituent and linking 
the token to that if necessary.

Original comment by torsten....@gmail.com on 4 Sep 2013 at 2:21

GoogleCodeExporter commented 9 years ago

That would entail that we also remove the "parent" feature from the Token.

I slightly tend to adopt the view that a token is a part of the constituency 
tree (a terminal node). The reason being this: 

If we create a kind of "pre-terminal" node in the constituency tree, what 
type/label would that have? Looking at how some parsers are implemented, the 
"pre-terminal" node bears the part-of-speech tag, while the terminal (the 
Token) is just of the text. In DKPro Core, however, the part-of-speech is 
attached to the Token (yeah... my fault, I know, but - as you too have noticed 
- very convenient). So the "pre-terminal" node would either duplicate the POS 
tag information (not good imho) or just be an empty dummy (likewise not so 
nice). 

I also think that removing "parent" from the token and introducing a 
pre-terminal may also require more extensive changes to the code than deriving 
Token from Constituent.

Original comment by richard.eckart on 4 Sep 2013 at 2:27

GoogleCodeExporter commented 9 years ago

Interesting, I hadn't even noticed the getParent() method so far.

It probably depends whether you have a parser-centric view or not.
In the other perspective, where tokens are created by a segmentation process, 
it makes little sense to define a "parent" of a token.

I am a bit worried that with making token a constituent, we adopt this 
parser-centric view which might have "interesting" consequences later.

Original comment by torsten....@gmail.com on 4 Sep 2013 at 2:35

GoogleCodeExporter commented 9 years ago

Fair point. However, we have a very real significant breaking of existing code 
and data when changing the structure, but maybe don't break much or anything if 
we change the inheritance. It's just intuition at the moment, a test would be 
required. If I am correct, I'd prefer to break nothing/little now than break 
much now to avoid problems we may or may not run into later... unless we have a 
clear picture what these problems would be and how we assess them.

Original comment by richard.eckart on 4 Sep 2013 at 4:02

GoogleCodeExporter commented 9 years ago

As am I not an ontologist, I am fine with not breaking things.

Original comment by torsten....@gmail.com on 4 Sep 2013 at 4:43

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 19 Dec 2013 at 1:58

Changed state: New

kulukimak / dkpro-core-asl

Token annotations cannot carry syntactic function #174