lapps / vocabulary-pages

DSL files and templates used to generate the LAPPS WS-EV pages.
Apache License 2.0
0 stars 0 forks source link

Semantic tags #92

Open marcverhagen opened 5 years ago

marcverhagen commented 5 years ago

We are about to add some discriminators for semantic category sets that cannot be interpreted as named entity category sets (GO categories for example, and WordNet synsets would fall in this category as well). And we are about to add semtags as a property on Token since we use that for the GOST tagger, and that new tag would be similar to pos except that we allow a list.

With semantic tags however we need to deal with the situation where the tag is not on a token but on a phrase. Do we add Chunk to the vocabulary with some generic properties including semtags (also see issue #90)? One thing not to like about that is that semtags would then be defined at two spots. Do we want to consider a Semantics type along a Morphology type?

ksuderman commented 5 years ago

I added the semtag feature on tokens to get something online quickly so @nancyide could test the GOST service for a paper she is working on. I am not convinced Token is a good place for them, but that is what GOST does so that is what I went with for the demo.

Nancy has since asked that I annotate items tagged with values from the Gene Ontology as "NamedEntities" with @category=Gene, however that is far from ideal as the GO contains labels for many things other than gene names. For example, "learning" (a biological process) gets labelled as "gene" which is obviously not correct.

Perhaps for things like GO tags or WordNet synsets we could repurpose the unused Lookup annotation type with something like:

{
  "@context" : "http://vocab.lappsgrid.org/context-1.0.0.jsonld",
  "metadata" : { },
  "text" : { },
  "views" : [ {
    "id" : "v1",
    "metadata" : {
      "contains" : {
        "http://vocab.lappsgrid.org/Lookup" : {
          "tagSet" : "http://geneontology.org",
          "producer" : "Example",
          "type" : "GO"
        }
      }
    },
    "annotations" : [ {
      "id" : "go-1",
      "@type" : "http://vocab.lappsgrid.org/Lookup",
      "features" : {
        "targets" : [ "some_region_id" ],
        "type" : "GO",
        "ids" : [ "0050877.3.N", "0032501.3.N", "0007612.0.N" ]
      }
    } ]
  }, {
    "id" : "v2",
    "metadata" : {
      "contains" : {
        "http://vocab.lappsgrid.org/Lookup" : {
          "tagSet" : "http://wordnet.princeton.edu",
          "producer" : "Example",
          "type" : "WN"
        }
      }
    },
    "annotations" : [ {
      "id" : "wn-1",
      "@type" : "http://vocab.lappsgrid.org/Lookup",
      "features" : {
        "targets" : [ "some_region_id" ],
        "type" : "WN",
        "ids" : [ "fly%2:38:01::" ]
      }
    } ]
  } ]
}
marcverhagen commented 5 years ago

Hmmmm, I like that reuse of something we do not use. The name might not be so great though because the result may come from some mechanism that is not a look up. And we would basically use it as a Tag type without calling it Tag.

marcverhagen commented 5 years ago

Is a SemanticTag just one tag or does it contain a list?

This is one of the issues from https://github.com/lapps/vocabulary-pages/issues/95). Assume an Annotation ("endoplasmic reticulum") that has two semantic tags, both GO tags (GO:0044464.3.N and GO:0043229.2.N).

Here is an example with one tag per SemanticTag. First the metadata (recall that we decided agains the likes of http://vocab.lappsgrid.org/SemanticTag#GO):

{
    "contains" : {
        "http://vocab.lappsgrid.org/SemanticTag" : {
            "tagSet" : "http://vocab.lappsgrid.org/ns/tagset/sem#go" }}
}

Now the tags as one tag per SemanticTag:

{
    "id" : "st1",
    "@type" : "http://vocab.lappsgrid.org/SemanticTag",
    "features" : {
        "targets" : [ "annotation87" ],
        "label": "GO:0044464.3.N" }
}

{
    "id" : "st2",
    "@type" : "http://vocab.lappsgrid.org/SemanticTag",
    "features" : {
        "targets" : [ "annotation87" ],
        "label": "GO:0043229.2.N" }
}

Now the tags as multiple tags per SemanticTag, borrowing from Keith's comment above:

{
    "id" : "st1",
    "@type" : "http://vocab.lappsgrid.org/SemanticTag",
    "features" : {
        "targets" : [ "annotation87" ],
        "ids": ["GO:0044464.3.N", "GO:0043229.2.N"] }
}

We can quarrel about what the best feature name is (label, labels, ids, categories, ....).

The second is more compact. The advantage of the first is the flexibility it affords if the tag is not just a singe label but has all kinds of other features (confidence, sub-label. It would also be easier to deal with the issue below.

UPDATE. There is an added benefit for the multiple tags, which is that you automatically group tags, for example, GOST assigns a bunch of GO categories to a region and when you have multiple SemanticTag instances for that list you need a mechanism to show the grouping. Keith prefers using multiple tags since it would be closer to what GOST does and GOST so far is the only semantic tagger we have wrapped. There is the issue of what happens when tags are not just a neat label, but come with other information (confidence score, sub types, etcetera). We would now need to maintain some lists or maps:

{
    "id" : "st1",
    "@type" : "http://vocab.lappsgrid.org/SemanticTag",
    "features" : {
        "targets" : [ "annotation87" ],
        "ids": ["GO:0044464.3.N", "GO:0043229.2.N"],
        "confidence": {
            "GO:0044464.3.N": 0.786, 
            "GO:0043229.2.N": 0.452 },
    }
}

Also see the comment Do we need a Token#semtags property? below for some related prose including on containers.

marcverhagen commented 5 years ago

How to deal with multiple tag sets

Another issue from https://github.com/lapps/vocabulary-pages/issues/95). We thought there were two options.

Option 1. One view and using a local tagSet property. First the metadata. I see two alternatives, you either give a list or a default tag:

{
    \\ using a list
    "contains" : {
        "http://vocab.lappsgrid.org/SemanticTag" : {
            "tagSet" : [
                "http://vocab.lappsgrid.org/ns/tagset/sem#go",
                "http://vocab.lappsgrid.org/ns/tagset/sem#usas"] }}
}
{
    \\ using a default
    "contains" : {
        "http://vocab.lappsgrid.org/SemanticTag" : {
            "tagSet" : "http://vocab.lappsgrid.org/ns/tagset/sem#go" }}
}

For the first we have the advantage that the tag sets are in the metadata, but the disadvantage that we make the value a list, which is different from other tag sets around the vocabulary. With using the default we do not have to make the value a list and we could omit the local tagSet feature for the default set. I find the first conceptually a bit cleaner.

Note that we may want to add a dependOn feature as well, which is not actually in the vocabulary yet.

Now the annotations. Assume an Annotation ("endoplasmic reticulum") that has two semantic tags, one GO tag (GO:0044464.3.N) and one USAS tag category (C1, "substances and materials generally").

{
      "id" : "st1",
      "@type" : "http://vocab.lappsgrid.org/SemanticTag",
      "features" : {
        "targets" : [ "annotation87" ],
        "tagSet" : "http://vocab.lappsgrid.org/ns/tagset/sem#go",
        "label": "GO:0044464.3.N" }
}

{
      "id" : "st2",
      "@type" : "http://vocab.lappsgrid.org/SemanticTag",
      "features" : {
        "targets" : [ "annotation87" ],
        "tagSet" : "http://vocab.lappsgrid.org/ns/tagset/sem#usas",
        "label": "C1" }
}

With the GO set as the default, the first semantic tag could leave out the tagSet property. Note that in either case (list of sets or single set), the local tagSet attribute is only needed if we have more tag sets, which is actually explicit in the list case.

Option2. Two views. Well. no examples here, this should be obvious. The advantage is simplicity of views, the disadvantage is extra views for these cases.

UPDATE. After discussion with Keith, we came up with something we like better for option 1. We like the list value for the tagSet metadata property. We can get rid of the tagSet property on the instances by using a prefix for the label:

{
      "id" : "st1",
      "@type" : "http://vocab.lappsgrid.org/SemanticTag",
      "features" : {
        "targets" : [ "annotation87" ],
        "label": "go:GO:0044464.3.N" }
}

{
      "id" : "st2",
      "@type" : "http://vocab.lappsgrid.org/SemanticTag",
      "features" : {
        "targets" : [ "annotation87" ],
        "label": "usas:C1" }
}

The prefix is only used when there are multiple tagsets and would be a suffix of the two discriminators, picking a suffix that discriminates, in this case just go and asus. If we have tagset discriminators http://vocab.lappsgrid.org/ns/tagset/sem#go and http://vocab.lappsgrid.org/ns/tagset/basic#go we would use sem#go and basic#go.

marcverhagen commented 5 years ago

Do we need a Token#semtags property?

(Another issue from https://github.com/lapps/vocabulary-pages/issues/95).

We don't since you can easily trace the tag to the token , but do we want it? The question is whether the token should be directly aware whether it has semantic tags. There is precedent here. A PhraseStructure annotation nows what constituents it has. It is probably nice to have the tags available in a semtags list. The problem is that we then need to probably also put that feature on other categories and we should then really have it on the Annotation type.

UPDATE. We realized this property might be a nice place to group tags. If we use multiple instances of SemanticTag to encode a list of tags as given by a tool like GOST, we lose the connection between those tags. With semtags we have that grouping. An alternative is to give SemanticTag a group attribute. Yet another thing we discussed was the potential use for a generic List annotation type. We have lists of course as values of many properties. This would be an actual annotation type in the vocabulary that allows you to associate a list of annotations with some region (it would be a direct subtype of Region). You could restrict all elements of the list to be of the same type if you want.

Example of a list of annotations:

{
    "id": "list1",
    "@type": "http://vocab.lappsgrid.org/List",
    "features": {
        "type": "http://vocab.lappsgrid.org/SemantiucTag",
        "elements": [ "st1", "st2" ] }
}

{
      "id" : "st1",
      "@type" : "http://vocab.lappsgrid.org/SemanticTag",
      "features" : {
        "targets" : [ "list1" ],
        "label": "go:GO:0044464.3.N" }
}

{
      "id" : "st2",
      "@type" : "http://vocab.lappsgrid.org/SemanticTag",
      "features" : {
        "targets" : [ "list1" ],
        "label": "usas:C1" }
}

We could have some container types like this (List, Set, Map). We may or may not want them to be listed in the vocabulary explicitly.