adding discriminators for different tagsets for POS tagging

marcverhagen commented 7 years ago

The core here is that some pos tag sets were added to the meta data, the rest is bookkeeping.

marcverhagen commented 7 years ago

Hmmm, I am now second guessing the naming of the discriminators. So I used brown and penn-tb and they would point to http://vocab.lappsgrid.org/ns/tagset/pos#brown and http://vocab.lappsgrid.org/ns/tagset/pos#penntb (don't bother clicking the links, the pages do not exist yet).

But I think when it comes to adding all these tagsets for different kinds of applications they should really be named pos-brown and pos-penntb. Later we may need to add discriminators for tools that have their own tag sets for each application. Let's say for the sake of argument that Stanford is one of them, we could have pos-stanford, ner-stanford, etcetera.

reckart commented 7 years ago

... and even different versions of a tool / model can come with different tagsets ;)

marcverhagen commented 7 years ago

Yes, so we need to be prepared to have pos-penntb-tt-v1 and pos-penntb-tt-v2 for two different treebank-inspired tagsets for the TreeTagger (I am using this example because I vaguely remember that at some point about 5-10 years ago the tags produced by the TreeTagger changed).

ksuderman commented 7 years ago

Why did you delete the download links to the binaries for the various tools from the README? Users shouldn't be required to build three different projects from source simply to generate the vocabulary pages.

To address the other points raised:

We already have pos in the URL path, I don't think it needs to be repeated in the tag set name.
We should agree on a common naming convention so we can safely generate set names with must (much) fear of name collisions.

Something like <vendor>-<family>-<versio>

I think having generic names for the most common tag sets is also a good idea as some tools are not that specific about what version of a tag set they are using, and some tag sets are not properly versioned, e.g. I think the Stanford tools make their own proprietary changes to the Penn TB tag set.

Examples

marcverhagen commented 7 years ago

Why did I delete those links? Good question. This was either done as a side effect of many cuts and pastes or more likely because those links were a bit redundant since the repository itself includes the jars that are downloaded. So if we have both the links and jars in the repository we have a potential for those jars to be out of sync and each time a jar is updated at the download side it needs to be updated in the repository and vice versa.

marcverhagen commented 7 years ago

To address the other points raised:

We already have pos in the URL path, I don't think it needs to be repeated in the tag set name.

We should agree on a common naming convention so we can safely generate set names with must (much) fear of name collisions.

Something like vendor-family-version

Yes, pos is in the path, and yes, we need a convention. I still think is is attractive to use pos-penntb instead of penntb since we may also need to add cat-penntb for the categories and I prefer having the pair pos-penntb-cat-penntb over the pair penntb - cat-penntb.

keighrim commented 7 years ago

We already have pos in the URL path, I don't think it needs to be repeated in the tag set name.

I'm with @ksuderman on this.

we may also need to add cat-penntb for the categories and I prefer having the pair pos-penntb-cat-penntb over the pair penntb - cat-penntb.

I believe we'll have a pair of (http://../pos#penntb - http://.../cat#penntb), at least in machine generated LIFs.

By the way, the cat is for labels of non-terminal constituents in ps-parsing, right?

marcverhagen commented 7 years ago

Hmmm, I appear to be outvoted so how I had it originally may be fine. We should still talk about the naming convention when we get a chance.

Yes, cat is for non-terminals.

ksuderman commented 7 years ago

AFAIK the repositories do not contain the downloadable jars, the releases page contains zip files and tarballs for the entire repo not the jar files. The links I added are to the package-latest.tgz files on downloads.lappsgrid.org so should always retrieve the latest executable as they get deployed as part of the release process

keighrim commented 7 years ago

I think @marcverhagen was referring binaries in https://github.com/lapps/vocabulary-pages/tree/master/bin . I agree with @marcverhagen that, with the current way, it coudl be really easily outsynced. Unless we update this repository everytime ghc, vocab-dsl, or discriminator-dsl gets an update, I think we should remove the bin directory and stay with the links to downloads.lappgrid.org.

ksuderman commented 7 years ago

Right, I forgot about the bin directory. However, I don't think it is a bad idea to have both and there shouldn't be too much danger of things getting out of sync. I just checked the three projects and each of them will upload the latest tarballs to http://www.anc.org/downloads and then generate a pull request here with the latest .jar file.

ksuderman commented 7 years ago

On reflection I think I'll revert my commit and merge this PR as is was after Marc's last commit.

marcverhagen commented 6 years ago

Trying to make sense of tag sets and reviving the pos-penntb discussion a bit here...

We may have been talking about two different things. There is the discriminator name and the full URI it refers to and for the Penn TB tagset we could have penntb as the short discriminator name and http://vocab.lappsgrid.org/ns/tagset/pos#penntb as the URI. Clearly the last little bit of the URI does not to repeat the fact that we have a pos, but I would not want to use penntb for the discriminator name but something that allows for different PennTB tagsets (pos, categories...).

The following is what's now in the develop branch.

discriminator	URI
tags-pos	http://vocab.lappsgrid.org/ns/tagset/pos
tags-pos-brown	http://vocab.lappsgrid.org/ns/tagset/pos#brown
tags-pos-penntb	http://vocab.lappsgrid.org/ns/tagset/pos#penntb
tags-pos-penntb-tt	http://vocab.lappsgrid.org/ns/tagset/pos#penntb-tt

This does not deal yet with Stanford having its own version(s) of PennTB tags. There was a proposal to use penntb (ptb actually, somewhat shorter) for the generic PennTB tags versus stanford-penntb. I would like to make a pitch to swap the latter and use penntb-stanford.

reckart commented 6 years ago

We have a list of tagsets with identifiers assigned opportunistically in DKPro Core . Those could give you an overview over variations that we have encountered so far. It would also be nice if we could align our respective IDs in some way.

The IDs are actually tuples [layer, tagset], e.g. pos, alpino-ixa for the variant of the Alpino tagset used by the IXA tools for POS annotation.

marcverhagen commented 6 years ago

Yes, I have gone through that list and actually referred to some URLs in there. We have the layer upfront, then something that looks like your tagset, so we could easily have something like: pos-alpino-ixa. We haven't thought a lot about language though, which appears to be the third element of your tuple. My first hunch would be to put that in the description and not have it as part of the ID.

reckart commented 6 years ago

Nice :) and right wrt the langauge :)

I don't think we use the language right now for anything else than displaying it in the documentation and I also believe that so far we did not encounter a tagset which has language-specific variations (but I would not be surprised if these exist). The models in DKPro Core use [tool, language, variant] as the identifier tuple, so I guess it simple seemed reasonable at the time to include the language in the tagset IDs as well.

reckart commented 6 years ago

we did not encounter a tagset which has language-specific variations

We do currently consider the UD POS tagsets for each language to be separate even tough they are actually the same. But I do believe that UD actually defines language specific variations of some tags.

reckart commented 6 years ago

http://universaldependencies.org/ext-feat-index.html

marcverhagen commented 6 years ago

http://universaldependencies.org/ext-feat-index.html

Yeah, not sure what, if anything, we want to do with a list like that. The features are the same, but many values occur in some languages only.

We do currently consider the UD POS tagsets for each language to be separate even tough they are actually the same

A better example for multiple language-specific tag sets with the same tagset identifier is Stein, where the French and Italian versions are different.

So we have a question to answer: do we want to use en as part of the name for all our current proposed identifiers (for example tags-pos-en-penntb)? And if so what position?

I think not actually. Those things will likely not occur very often and I would rather just add a language marker at the end if needed, roughly where we would put our version: tags-pos-pentb for the simple case and tags-pos-stein-it and tags-pos-stein-fr for when we need the distinction.

lapps / vocabulary-pages

adding discriminators for different tagsets for POS tagging #60