Open GoogleCodeExporter opened 9 years ago
Is there some documentation about this syntactic functions somewhere? I
couldn't find something in a quick search.
Original comment by torsten....@gmail.com
on 4 Sep 2013 at 12:38
Yep, here:
http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/kanten.html
The pos tags are here:
http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/stts.asc
E.g. NN-DA = normal noun (NN) + dative (DA)
The connection between the token layer and the constituent layer is a bit
awkward at the moment. E.g. the "parent" feature in the token is of type
Annotation because the token is in the segmentation API while the type
Constituent is in the syntax API (and syntax depends on segmentation, so
segmentation cannot depend on syntax). We might want to consider if the token
doesn't really belong to the syntax API, or if we can find a way that the
syntax API doesn't depend on the segmentation API. Btw. the dependency is
introduced because the Dependency type uses Token as its endpoints. So if we
move Dependency somewhere else... e.g. to "api.syntax.dependency".
Original comment by richard.eckart
on 4 Sep 2013 at 1:01
I might miss a point, but I don's see why token should carry syntactic function
at all.
But I guess this is the old discussion about features instead of offset bound
retrieval :)
So if you think it makes sense to have it in token, I am fine with it.
Original comment by torsten....@gmail.com
on 4 Sep 2013 at 1:12
Here is my current opinion on the matter of offsets vs. features:
Offsets are a good starting point, in particular if it is not clear how often a
navigation path is used, if extensibility is an issue, and if one is not
familiar yet with the details of what is to be annotated.
Features are good if it is known that a navigation path is used often (and
should be reasonably fast), it it is known that extensibility is not a problem,
which entails that there is a good familiarity with what is to be annotated.
In this issue, we have the case that we know there are syntactic function
labels on edges between constituents. There is a corresponding feature in the
Constituent type (although, admittedly, afaik we don't use it much). We treat
Tokens as terminals in the constituency structure, but in fact, in our type
system, Token does not inherit from Constituent and thus is not a Constituent.
So we have a conceptual problem here:
- on the one hand, we treat Token as a terminal in the constituency structure,
which means that there is an edge between the Token and the constituent above.
Such an edge should allow for a syntactic function label.
- on the other hand, Token *is not* a Constituent. It it was, it would
automatically inherit the "syntacticFunction" feature from the Constituent type.
So... is the Token a constituent or not?
If it is, then it should probably inherit from Consituent.
If it is not, then we should probably change our parser wrappers so, that an
additional terminal constituent is introduced in the constituency tree which
can bear the syntactic function that would otherwise be associated directly
with the token.
Original comment by richard.eckart
on 4 Sep 2013 at 2:17
Nice summary.
In my world (TM), a Token is not a constituent.
So I would vote for introducing an additional terminal constituent and linking
the token to that if necessary.
Original comment by torsten....@gmail.com
on 4 Sep 2013 at 2:21
That would entail that we also remove the "parent" feature from the Token.
I slightly tend to adopt the view that a token is a part of the constituency
tree (a terminal node). The reason being this:
If we create a kind of "pre-terminal" node in the constituency tree, what
type/label would that have? Looking at how some parsers are implemented, the
"pre-terminal" node bears the part-of-speech tag, while the terminal (the
Token) is just of the text. In DKPro Core, however, the part-of-speech is
attached to the Token (yeah... my fault, I know, but - as you too have noticed
- very convenient). So the "pre-terminal" node would either duplicate the POS
tag information (not good imho) or just be an empty dummy (likewise not so
nice).
I also think that removing "parent" from the token and introducing a
pre-terminal may also require more extensive changes to the code than deriving
Token from Constituent.
Original comment by richard.eckart
on 4 Sep 2013 at 2:27
Interesting, I hadn't even noticed the getParent() method so far.
It probably depends whether you have a parser-centric view or not.
In the other perspective, where tokens are created by a segmentation process,
it makes little sense to define a "parent" of a token.
I am a bit worried that with making token a constituent, we adopt this
parser-centric view which might have "interesting" consequences later.
Original comment by torsten....@gmail.com
on 4 Sep 2013 at 2:35
Fair point. However, we have a very real significant breaking of existing code
and data when changing the structure, but maybe don't break much or anything if
we change the inheritance. It's just intuition at the moment, a test would be
required. If I am correct, I'd prefer to break nothing/little now than break
much now to avoid problems we may or may not run into later... unless we have a
clear picture what these problems would be and how we assess them.
Original comment by richard.eckart
on 4 Sep 2013 at 4:02
As am I not an ontologist, I am fine with not breaking things.
Original comment by torsten....@gmail.com
on 4 Sep 2013 at 4:43
Original comment by richard.eckart
on 19 Dec 2013 at 1:58
Original issue reported on code.google.com by
richard.eckart
on 30 Jun 2013 at 5:03