Open marcverhagen opened 5 years ago
Probably CLAWS6 or CLAWS8, but need to look at more tags. Note that YCOL and YDSH are punctuation tags and that CLAWS7 is CLAWS6 minus punctuation tags.
tag | CLAWS5 | CLAWS6 | CLAWS7 | CLAWS8 |
---|---|---|---|---|
MCMC | - | + | + | + |
NN1 | + | + | + | + |
NNU | - | + | + | + |
NP1 | - | + | + | + |
VV0 | - | + | + | + |
YCOL | - | + | - | + |
YDSH | - | + | - | + |
In addition, beyond the pos tags, GOST also produces semantic tags from the 200+ basic semantic tags from the UCREL Semantic Analysis System (USAS, http://ucrel.lancs.ac.uk/usas/) as well as identifiers from the GO ontology. The GOST service uses a list-valued semtags
attribute on the Token to put these (where the list will either have one USAS semantic tag or one or more GO categories).
Because we have two tagsets for the same property, we need to define this in the metadata a bit differently from existing tag set definitions where we just give a URI, for example for the value of posTagSet on Token we can use a URI pointing to a tag set discriminator in the vocabulary. Now that we have both USAS types and GO categories in the semtags property, we need to be able to say that in the metadata
Properties | Types | Description |
---|---|---|
semanticTags | List of String or URI | Semantic types that can be used in the semtags property |
So in the metadata we can say:
{ "contains": {
"http://vocab.lappsgrid.org/Token": {
"semanticTags": [ "tags-sem-bio-go", "tags-sem-basic-asus" ] }}}
For the full names I am proposing one of the following:
ns/tagset/sem#basic-asus
and ns/tagset/sem#bio-go
ns/tagset/sem-basic#asus
and ns/tagset/sem-bio#go
ns/tagset/sem/basic#asus
and ns/tagset/sem/bio#go
I think I prefer the last one because the number of different set of semantic tags may be impressive.
For the full names we are now leaning towards not creating a subdirectories http://vocab.lappsgrid.org/ns/tagset/sem, so we would get something like
name | url |
---|---|
tags-sem-asus | http://vocab.lappsgrid.org/ns/tagset/sem#asus |
tags-sem-bio-go | http://vocab.lappsgrid.org/ns/tagset/sem#bio-go |
Keith and I discussed what to do about semantic tags (not entirely related to the below, I think)—we decided on a new view (layer) called SemanticTag, which could also be used for sense tags etc.
What is asus?
On May 2, 2019, at 11:50 AM, marcverhagen notifications@github.com wrote:
For the full names we are now leaning towards not creating subdirectories for http://vocab.lappsgrid.org/ns/tagset/sem http://vocab.lappsgrid.org/ns/tagset/sem, so we would get something like
name url tags-sem-asus http://vocab.lappsgrid.org/ns/tagset/sem#asus http://vocab.lappsgrid.org/ns/tagset/sem#asus tags-sem-bio-go http://vocab.lappsgrid.org/ns/tagset/sem#bio-go http://vocab.lappsgrid.org/ns/tagset/sem#bio-go — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lapps/vocabulary-pages/issues/89#issuecomment-488727281, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7M3P26OVWSX5HGUIQA7RTPTMEUXANCNFSM4HHEDS5Q.
Nancy Ide Professor of Computer Science
Department of Computer Science Vassar College Poughkeepsie, New York 12604-0520 USA
tel: (+1 845) 437 5988 fax: (+1 845) 437 7498 email: ide@cs.vassar.edu http://www.cs.vassar.edu/~ide
The asus discriminator refers to the 200+ semantic tags used by the UCREL Semantic Analysis System (USAS), and they are in the GOST output.
I am adding CLAWS tag sets to the vocabulary (so far just CLAWS5 and CLAWS7), but it is not clear what the GOST tagger is using. It is clearly not CLAWS five, as shown in the table below for the tags that appear in the GOST output for MASC3-0203 (which by the way only gives 23 tokens), but CLAWS7 isn't it either.
Must look into the other CLAWS tag sets.