lapps / vocabulary-pages

DSL files and templates used to generate the LAPPS WS-EV pages.
Apache License 2.0
0 stars 0 forks source link

Tagset for GOST tagger #89

Open marcverhagen opened 5 years ago

marcverhagen commented 5 years ago

I am adding CLAWS tag sets to the vocabulary (so far just CLAWS5 and CLAWS7), but it is not clear what the GOST tagger is using. It is clearly not CLAWS five, as shown in the table below for the tags that appear in the GOST output for MASC3-0203 (which by the way only gives 23 tokens), but CLAWS7 isn't it either.

tag CLAWS5 CLAWS7
MCMC - +
NN1 + +
NNU - +
NP1 - +
VV0 - +
YCOL - -
YDSH - -

Must look into the other CLAWS tag sets.

marcverhagen commented 5 years ago

Probably CLAWS6 or CLAWS8, but need to look at more tags. Note that YCOL and YDSH are punctuation tags and that CLAWS7 is CLAWS6 minus punctuation tags.

tag CLAWS5 CLAWS6 CLAWS7 CLAWS8
MCMC - + + +
NN1 + + + +
NNU - + + +
NP1 - + + +
VV0 - + + +
YCOL - + - +
YDSH - + - +
marcverhagen commented 5 years ago

In addition, beyond the pos tags, GOST also produces semantic tags from the 200+ basic semantic tags from the UCREL Semantic Analysis System (USAS, http://ucrel.lancs.ac.uk/usas/) as well as identifiers from the GO ontology. The GOST service uses a list-valued semtags attribute on the Token to put these (where the list will either have one USAS semantic tag or one or more GO categories).

Because we have two tagsets for the same property, we need to define this in the metadata a bit differently from existing tag set definitions where we just give a URI, for example for the value of posTagSet on Token we can use a URI pointing to a tag set discriminator in the vocabulary. Now that we have both USAS types and GO categories in the semtags property, we need to be able to say that in the metadata

Properties Types Description
semanticTags List of String or URI Semantic types that can be used in the semtags property

So in the metadata we can say:

{ "contains": {
   "http://vocab.lappsgrid.org/Token": {
      "semanticTags": [ "tags-sem-bio-go", "tags-sem-basic-asus" ] }}}

For the full names I am proposing one of the following:

  1. ns/tagset/sem#basic-asus and ns/tagset/sem#bio-go
  2. ns/tagset/sem-basic#asus and ns/tagset/sem-bio#go
  3. ns/tagset/sem/basic#asus and ns/tagset/sem/bio#go

I think I prefer the last one because the number of different set of semantic tags may be impressive.

marcverhagen commented 5 years ago

For the full names we are now leaning towards not creating a subdirectories http://vocab.lappsgrid.org/ns/tagset/sem, so we would get something like

name url
tags-sem-asus http://vocab.lappsgrid.org/ns/tagset/sem#asus
tags-sem-bio-go http://vocab.lappsgrid.org/ns/tagset/sem#bio-go
nancyide commented 5 years ago

Keith and I discussed what to do about semantic tags (not entirely related to the below, I think)—we decided on a new view (layer) called SemanticTag, which could also be used for sense tags etc.

What is asus?

On May 2, 2019, at 11:50 AM, marcverhagen notifications@github.com wrote:

For the full names we are now leaning towards not creating subdirectories for http://vocab.lappsgrid.org/ns/tagset/sem http://vocab.lappsgrid.org/ns/tagset/sem, so we would get something like

name url tags-sem-asus http://vocab.lappsgrid.org/ns/tagset/sem#asus http://vocab.lappsgrid.org/ns/tagset/sem#asus tags-sem-bio-go http://vocab.lappsgrid.org/ns/tagset/sem#bio-go http://vocab.lappsgrid.org/ns/tagset/sem#bio-go — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lapps/vocabulary-pages/issues/89#issuecomment-488727281, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7M3P26OVWSX5HGUIQA7RTPTMEUXANCNFSM4HHEDS5Q.


Nancy Ide Professor of Computer Science

Department of Computer Science Vassar College Poughkeepsie, New York 12604-0520 USA

tel: (+1 845) 437 5988 fax: (+1 845) 437 7498 email: ide@cs.vassar.edu http://www.cs.vassar.edu/~ide


marcverhagen commented 5 years ago

The asus discriminator refers to the 200+ semantic tags used by the UCREL Semantic Analysis System (USAS), and they are in the GOST output.