lapps / vocabulary-pages

DSL files and templates used to generate the LAPPS WS-EV pages.
Apache License 2.0
0 stars 0 forks source link

Discuss whether to have PosTag as an annotation #12

Closed marcverhagen closed 8 years ago

reckart commented 8 years ago

We have POS as an annotation in DKPro Core. That allows us to also have coarse-grained POS tags like N, V, etc. as types.

I'm curious - what are your reasons for having PosTag as an annotation?

nancyide commented 8 years ago

We discussed this (I even talked to you about it in Darmstadt if you recall) because it is obviously somewhat arbitrary, and if you look across UIMA type systems there is no commonly agreed way to do it. We decided to leave it as a property on token because otherwise, for consistency we would have to make lemma and possibly other things a separate annotation as well, which would start to get messy.

I do not understand why having POS on the token disallows having coarse-grained tags. In our scheme, you would simply reference the posTagSet, whatever it is, in the metadata. That could be anything.

On Nov 17, 2015, at 4:51 PM, Richard Eckart de Castilho notifications@github.com wrote:

We have POS as an annotation in DKPro Core. That allows us to also have coarse-grained POS tags like N, V, etc. as types.

I'm curious - what are your reasons for having PosTag as an annotation?

— Reply to this email directly or view it on GitHub https://github.com/lapps/vocabulary-pages/issues/12#issuecomment-157520403.

reckart commented 8 years ago

I'm not arguing for either way of doing it really. DKPro Core has annotations for POS and there are good and not-so-good sides to it. You chose not to have them as annotations so far but appear to be considering to change that. I'm just curious about your motivations.

Regarding having coarse/fine grained tags: having POS on the token doesn't disallow coarse grained tags, but obviously you have to make a choice what you want to use. In DKPro Core, we actually have both at the same time.

We have a coarse grained tag set of fixed categories closely resembling the Universal Pos Tags. We have actual UIMA types for these coarse grained categories (consider the following to be an inheritance hierarchy):

Annotation
  POS
    N
    V
    ADJ
    ...

The POS type has a "posValue" feature which gets the original tag produced by a tagger. So if a tagger produces "NNP", we write that there. Then we look up NNP in a mapping file that gives us the coarse grained category, e.g. "N" and the UIMA type for that category (e.g. "de.tudarmstadt....dkpro.pos.N").

The actual annotation then created will be this

N { 
   begin: 1
   end: 2
   posValue: "NNP"
}

So we have the coarse grained category as the annotation type Name ("N") and the "NNP" as the original tag.

A token would then look like this:

Token {
  begin: 1
  end: 2
  pos: (reference to N annotation below)
} 

N { 
   begin: 1
   end: 2
   posValue: "NNP"
}

Additional POS tags for a token could be created at the same offsets as the token (e.g. for an ensemble-based approach to tagging), but the Token can only ever point to one of them (the canonical one).

Again, I'm not advertising any approach - I'm just stating how it is done in DKPro Core. And I have to admit, I've already been thinking about alternative approaches and their consequences.

In general, I find it quite useful to be able to store coarse grained and fine grained tags at the same time, e.g. because many CONLL formats do that. One reason I cannot do a full round-trip reading/writing CONLL files is that I cannot properly preserve the coarse-grained/fine-grained tags in the DKPro Core type system.

reckart commented 8 years ago

Anyway, I commented on the issue because it was still open - so I wondered if you still contemplate to change it from property to annotation (and why). If this is issue was just open by accident and you are happy with your choice, great :) Then let's close the issue.

nancyide commented 8 years ago

We are not committed to anything at this point…I think we could accommodate the situation you describe without making POS a separate annotation, but no time at the moment to come up with something. I’ll try later!

On Nov 18, 2015, at 8:02 AM, Richard Eckart de Castilho notifications@github.com wrote:

I'm not arguing for either way of doing it really. DKPro Core has annotations for POS and there are good and not-so-good sides to it. You chose not to have them as annotations so far but appear to be considering to change that. I'm just curious about your motivations.

Regarding having coarse/fine grained tags: having POS on the token doesn't disallow coarse grained tags, but obviously you have to make a choice what you want to use. In DKPro Core, we actually have both at the same time.

We have a coarse grained tag set of fixed categories closely resembling the Universal Pos Tags. We have actual UIMA types for these coarse grained categories (consider the following to be an inheritance hierarchy):

Annotation POS N V ADJ ... The POS type has a "posValue" feature which gets the original tag produced by a tagger. So if a tagger produces "NNP", we write that there. Then we look up NNP in a mapping file that gives us the coarse grained category, e.g. "N" and the UIMA type for that category (e.g. "de.tudarmstadt....dkpro.pos.N").

The actual annotation then created will be this

N { begin: 1 end: 2 posValue: "NNP" } So we have the coarse grained category as the annotation type Name ("N") and the "NNP" as the original tag.

A token would then look like this:

Token { begin: 1 end: 2 pos: (reference to N annotation below) }

N { begin: 1 end: 2 posValue: "NNP" } Additional POS tags for a token could be created at the same offsets as the token (e.g. for an ensemble-based approach to tagging), but the Token can only ever point to one of them (the canonical one).

Again, I'm not advertising any approach - I'm just stating how it is done in DKPro Core. And I have to admit, I've already been thinking about alternative approaches and their consequences.

In general, I find it quite useful to be able to store coarse grained and fine grained tags at the same time, e.g. because many CONLL formats do that. One reason I cannot do a full round-trip reading/writing CONLL files is that I cannot properly preserve the coarse-grained/fine-grained tags in the DKPro Core type system.

— Reply to this email directly or view it on GitHub https://github.com/lapps/vocabulary-pages/issues/12#issuecomment-157705826.


Nancy Ide Professor of Computer Science

Department of Computer Science Vassar College Poughkeepsie, New York 12604-0520 USA

tel: (+1 845) 437 5988 fax: (+1 845) 437 7498 email: ide@cs.vassar.edu http://www.cs.vassar.edu/~ide


ksuderman commented 8 years ago

We have decided to leave pos tags as properties on Tokens for now. We can revisit this again if compelling use cases are encountered.