Preserve information about tagset

GoogleCodeExporter commented 9 years ago

There is no way to store in the DKPro type system which tagset is used in a 
particular layer, e.g. that the part-of-speech layer uses the "STTS" tagset. 

If we have a writer for a data format which can make use of this information, 
we currently have to introduce special writer parameters or find some other way 
to passing this information to the write. It would be better, if the writer 
could just read this information from the CAS.

An option would be to introduce a new type "TagSet" which has a "name" (e.g. 
STTS) and a "layer" (e.g. POS). It could also be considered to introduce a type 
"Tag" which contains the individual tags within the tagset.

Any component which is creating annotations with tag values, should then create 
such an annotation in the CAS and populate it.

Original issue reported on code.google.com by richard.eckart on 25 Jun 2013 at 10:45

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 25 Jun 2013 at 10:46

Changed state: New

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 25 Jun 2013 at 10:56

GoogleCodeExporter commented 9 years ago

Hi,
this sounds very useful and important. Could such a type be used for tagging 
text with UBY-"tags"?
E.g., with the "TagSet" type version, that would be something like "name"= 
ubySemanticTag and "layer" = semantics

Best
Judith

Original comment by eckle.kohler on 26 Jun 2013 at 10:20

GoogleCodeExporter commented 9 years ago

What I suggest is to preserve information about the tagset, its name, layer, 
and its tags. This is meta information which doesn't actually tag anything. 

For example, consider running the OpenNLP tagger with a model for German. It 
creates annotations of the type POS (or subtypes) which carry a feature 
"posValue". What actual values can "posValue" assume and from which inventory 
do they come? To record this information, one single "TagSet" feature structure 
(not annotation) could be added to the CAS such as:

TagSet {
  name: "STTS"
  layer= "de.tudarmstadt.ukp.dkpro.core.api.lexmorph.POS"
  tags = { "NNP", "NN", "ADJ", ... }
}

None of this information refers directly to the text. The text, however, is 
annotated with the POS annotations, e.g.

de.tudarmstadt.ukp.dkpro.core.api.lexmorph.POS {
  begin: 10
  end: 15
  posValue: "NNP"
}

What does "NNP" mean? I could look up the TagSet feature structures, search for 
the one that applies to the POS layer 
(de.tudarmstadt.ukp.dkpro.core.api.lexmorph.POS or any subtypes) and see that 
"NNP" belongs to the "STTS" tagset. It could be imagined to add a link to some 
external normative resource, e.g. ISOCat. For the moment, that's beyond my 
use-case, though.

I imagine this can be used for components such as POS taggers or parsers, but 
e.g. not for lemmatizers or stemmers, because these do not have the notion of a 
closed controlled vocabulary. Or if the have, the vocabulary may be very large 
and it would be inconvenient to fully record it in the CAS.

If there would be an annotator which used Uby to create annotations, I could 
imagine that this annotator could also add TagSet information to the CAS, 
informing the user/downstream components which controlled vocabulary the "tags" 
come form. I fear, though, that a user of Uby may be looking for either 
something way more sophisticated that what I suggest here, e.g. recoding full 
lexical entries in the CAS, or more simple, e.g. recording a link to a Uby 
lexical entry in the CAS.

Original comment by richard.eckart on 26 Jun 2013 at 10:37

GoogleCodeExporter commented 9 years ago

>> I fear, though, that a user of Uby may be looking for either something way 
more sophisticated that what I suggest here, e.g. recoding full lexical entries 
in the CAS, or more simple, e.g. recording a link to a Uby lexical entry in the 
CAS. 

In many applications, a user might not be interested in such complex 
information from Uby. So your new type might be actually useful for semantic 
tagging with Uby.
We should discuss it F2F, because I aggregated some more ideas on that.

Original comment by eckle.kohler on 26 Jun 2013 at 10:44

GoogleCodeExporter commented 9 years ago

My primary use case right now would be to write this tagset information in a 
writer. 

In the TcfWriter from WebAnno, the tagset names are currently hard-coded, which 
is bad. I would like to avoid having to add parameters for the tagset names and 
instead read them from the CAS.

The tagset information could also be used by other writers. E.g the Negra 
export format supports tagset definitions. We do not have a NegraExportWriter 
yet, We have a NegraExportReader, though, which could actually read tagset 
information from Negra files and record it in the CAS.

Another conceivable use-case would be to have components validate their 
compatibility at runtime. We noted that the Penn Tagset used by the TreeTagger 
model for English is not the same as the one expected by the Stanford parser. 
This cause problems when we used the TreeTagger to create POS tags and then 
used the StanfordParser only to created the constituency structure, based on 
the TreeTagger POS tags. If the TreeTagger component recorded the tagset in the 
CAS, the StanfordParser could look at this information and issue a warning or 
error if it the tagset does not correspond to the ones expected by the parser.

Original comment by richard.eckart on 26 Jun 2013 at 10:53

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 27 Jun 2013 at 4:31

Changed state: Started

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 4 Aug 2013 at 9:14

Changed state: Fixed
Added labels: Milestone-1.5.0, Module-api.resources

google-code-export / dkpro-core-asl

Preserve information about tagset #168