lapps / vocabulary-pages

DSL files and templates used to generate the LAPPS WS-EV pages.
Apache License 2.0
0 stars 0 forks source link

LIF: motivate/acknowledge targets on Span #8

Closed marcverhagen closed 7 years ago

marcverhagen commented 8 years ago

Add some notes on this to the Coreference section, but note that it really belongs somewhere in the vocabulary.

reckart commented 8 years ago

What is the targets really used for anyway? The documentation says that it is an alternative to offsets. What does it mean? Are offsets and targets mutually exclusive? If both are used, is start/end min/max of the targets? Or maybe they must both be used? If targets are used, how is this information e.g. used when feeding data to a POS tagger or parser or other component?

nancyide commented 8 years ago

In some cases you are not referring to a span in text, but rather another annotation, e.g., a coreference. Fo this you need to be able to address ID instead of offsets in the text.

On Nov 17, 2015, at 5:00 PM, Richard Eckart de Castilho notifications@github.com wrote:

What is the targets really used for anyway? The documentation says that it is an alternative to offsets. What does it mean? Are offsets and targets mutually exclusive? If both are used, is start/end min/max of the targets? Or maybe they must both be used? If targets are used, how is this information e.g. used when feeding data to a POS tagger or parser or other component?

— Reply to this email directly or view it on GitHub https://github.com/lapps/vocabulary-pages/issues/8#issuecomment-157522845.

reckart commented 8 years ago

Absolutely. I wonder though if having a dedicated property in such a case wouldn't make more sense than re-using "targets" which appears more to be a way of modelling discontinuous spans. For example, in the Constituent type, we have "children" instead of re-using "targets".

nancyide commented 8 years ago

I am not sure I understand what you are suggesting here. Can you explain?

On Nov 17, 2015, at 5:04 PM, Richard Eckart de Castilho notifications@github.com wrote:

I wonder though if having a dedicated property in such a case wouldn't make more sense than re-using "targets" which appears more to be a way of modelling discontinuous spans.

reckart commented 8 years ago

"targets" is defined on Span which is the base type for many other types:

I just noticed again that Coreference actually doesn't inherit from Span and it uses the "mentions" property to point to annotations. Actually now I am confused why targets are motivated with coreference.

marcverhagen commented 8 years ago

This is a thorny issue that we have discussed before but never totally resolved.

The subtypes of spans are those where it generally makes sense to point at a span in the text. But for some of them, and not necessarily for all, it indeed makes sense to have a way to deal with discontinuous tags. And the targets property can be used for that.

We could have added targets to just those subtypes of Span that need them (and Markable is one of them, there is some discussion on this in http://lapps.github.io/interchange/coref.html in section 3), but we decided to put targets on Span, which has the disadvantage of being a bit unintuitive. What we really want is a better name for Span (or maybe even some form of multiple inheritance).

I do agree that the phrase "or to link two or more annotations (e.g., in a coreference annotation)" is confusing, and I do not think we use it for that purpose. If we have a relation to specify we should probably just name it.

reckart commented 8 years ago

Regarding discontinuous annotations:

Discontinuous are a bit tricky - I guess they are reasonably easy to model (the exact manner is a bit controversial), but the semantics when communicating with automatic analysis components remain somewhat unclear to me.

Recently, I have been thinking about modelling discontinuity by adding a "next" property to types to indicate that two annotations of the same type actually form one. This would be an attempt to maintain the split structure expected by most automatic tools and still model the discontinuity for human users. Another approach could be to have some "anchor" type which contains only begin/end and use that instead of having the begin/end embedded in a generic Span/Annotation type - but it would mean major reengineering of many tools wrappers and e.g. for an UIMA context, it would mean departing from the standard begin/end in the Annotation base type. So for the time being, I prefer staying away from discontinuous annotations. I cannot remember having encountered them for automatic tools and for manual tools (e.g. WebAnno), we ask people to model them through a relation.

Regarding multiple inheritance: How would you model the situation with multiple inheritance?

nancyide commented 8 years ago

On Nov 17, 2015, at 5:14 PM, Richard Eckart de Castilho notifications@github.com wrote:

"targets" is defined on Span which is the base type for many other types:

Sentence - targets could point to tokens and to define token order, e.g. in oddball cases where there are multiple tokens with the same offset NounChunk - could use targets to model discontinuous chunks VerbChunk - could use targets to model discontinuous chunks NamedEntity - - could use targets to model discontinuous NEs Token - not really clear what to use targets for... tokens pointing to tokens to model discontinuous tokens appears slightly odd - but might be a possibility

Not sure why you think this is odd. We do it this way in GrAF, where the targets are (by default) an ordered list of constituents.

Markable - again not really clear what to use targets for

Many coreferencers include coreference chains on Markables that give the ids of all other items in the chain.

Constituent - could be using targets to point to children, but that is what the "children" property is already used for - use of targets is unclear I just noticed again that coreference actually doesn't inherit from Span and it uses the "mentions" property to point to annotations. Actually now I am confused why targets are motivated with coreference.

I think this is a mistake—we should make this consistent. — Reply to this email directly or view it on GitHub https://github.com/lapps/vocabulary-pages/issues/8#issuecomment-157526062.


Nancy Ide Professor of Computer Science

Department of Computer Science Vassar College Poughkeepsie, New York 12604-0520 USA

tel: (+1 845) 437 5988 fax: (+1 845) 437 7498 email: ide@cs.vassar.edu http://www.cs.vassar.edu/~ide


reckart commented 8 years ago

Markable - again not really clear what to use targets for Many coreferencers include coreference chains on Markables that give the ids of all other items in the chain.

Isn't that the purpose of the "mentions" in the Coreference type?

Token - not really clear what to use targets for... tokens pointing to tokens to model discontinuous tokens appears slightly odd - but might be a possibility Not sure why you think this is odd. We do it this way in GrAF, where the targets are (by default) an ordered list of constituents.

A typical thing when talking to an automatic analysis tool is to give it a list of the tokens. I would consider it the general expectation that fetching all Tokens from a text would e.g. yield exactly the input for a POS tagger:

This
is
a
test
.

Now if tokens have targets that point to other tokens, things become odd. The following could be the result of such an attempt if somebody found it would be a good idea to create a Token annotation over "John Meyer" that points to the targets "John" and "Meyer".

This
is
John Meyer
John
Meyer
.   

That's not what the POS tagger would expect of course.

reckart commented 8 years ago

A better example might be one where a phrasal verb is split and a user defines a Token over the whole phrasal verb with targets pointing to Tokens representing the individual parts of the phrasal verb. A POS tagger would want to see only the Tokens for the individual parts, not for the phrasal verb as a whole.

nancyide commented 8 years ago

On Nov 18, 2015, at 7:47 AM, Richard Eckart de Castilho notifications@github.com wrote:

Markable - again not really clear what to use targets for Many coreferencers include coreference chains on Markables that give the ids of all other items in the chain.

Isn't that the purpose of the "mentions" in the Coreference type?

Yes I thought that we had eliminated mentions when we added targets.

Token - not really clear what to use targets for... tokens pointing to tokens to model discontinuous tokens appears slightly odd - but might be a possibility Not sure why you think this is odd. We do it this way in GrAF, where the targets are (by default) an ordered list of constituents.

A typical thing when talking to an automatic analysis tool is to give it a list of the tokens. I would consider it the general expectation that fetching all Tokens from a text would e.g. yield exactly the input for a POS tagger:

This is a test . Now if tokens have targets that point to other tokens, things become odd. The following could be the result of such an attempt if somebody found it would be a good idea to create a Token annotation over "John Meyer" that points to the targets "John" and "Meyer".

This is John Meyer John Meyer .
That's not what the POS tagger would expect of course.

Obviously. But we cannot try to enforce every possibility for every pair of tools a priori. We have metadata indicating which tool produced the tokens (and even which rules were used), so compatibility is determined based on a match/compatibility between the tokenizer output type and the expected input type for the pos tagger.

— Reply to this email directly or view it on GitHub https://github.com/lapps/vocabulary-pages/issues/8#issuecomment-157701529.


Nancy Ide Professor of Computer Science

Department of Computer Science Vassar College Poughkeepsie, New York 12604-0520 USA

tel: (+1 845) 437 5988 fax: (+1 845) 437 7498 email: ide@cs.vassar.edu http://www.cs.vassar.edu/~ide


reckart commented 8 years ago

Do you know of an example of an automatic processing tool that consumes or produces tokens that are not a non-overlapping sequence of characters spans?

nancyide commented 8 years ago

No.

On Nov 18, 2015, at 8:23 AM, Richard Eckart de Castilho notifications@github.com wrote:

Do you know of an example of an automatic processing tool that consumes or produces tokens that are not a non-overlapping sequence of characters spans?

— Reply to this email directly or view it on GitHub https://github.com/lapps/vocabulary-pages/issues/8#issuecomment-157711611.


Nancy Ide Professor of Computer Science

Department of Computer Science Vassar College Poughkeepsie, New York 12604-0520 USA

tel: (+1 845) 437 5988 fax: (+1 845) 437 7498 email: ide@cs.vassar.edu http://www.cs.vassar.edu/~ide


marcverhagen commented 8 years ago

It is true that for many of the subtypes of Region may not have any use for targets. Token might be a good example and tokens pointing to tokens does not make a lot of sense. But imagine a tool that has tokens consisting of morphemes, in that case it makes perfect sense to use targets.

Mentions were not introduced to replace targets. You can see targets as a generic way of allowing annotations to refer to other annotations. But in many cases that way is codified in linguistic practice with another name (constituents for syntactic parses, daughters for constituents, mentions for the elements of coreference chains), and using these names is more informative. We will live with the targets property still being there when they are not needed. It is either that or adding target to all annotations that need them, and we like to introduce properties only once. With fancier inheritance we could shadow the targets property and make it go away, but I think that does not make things easier to understand.

Also, mentions are the elements of a coreference chain and one of those mentions could be a split antecedent. Therefore we made the mentions all markables and each markable can point to one or more annotations, be they tokens or constituents or whatnot.

In a sense, we allow regions to indirectly refer to a span or set of spans in the primary data, which makes them a little bit like relations.

What we have is a practical solution, with perhaps somewhat questionable semantics, but it allows for some flexibility and so far it was the simplest we could think of.

marcverhagen commented 8 years ago

@reckart By the way, in a recent meeting we decided there should be a document that presents our motivations for designing the vocabulary as we did and that we will link to this document from the vocab pages.

ksuderman commented 7 years ago

Given this is two years old I am going to close this issue.

Some closing notes:

  1. @targets and @start/@end are meant to be mutually exclusive, but we do not have a way to formally express this and will have to do it in the prose.
    • It is expected that almost all regions will use the @start and @end features
  2. The Markable type was introduced so coreference annotations could refer to text that had not been annotated in any other way
  3. Since Coreference is not a sub-type of Region it does not have a @targets attribute and that is why it uses @mentions.