Closed marcverhagen closed 6 years ago
Hmmm, I am now second guessing the naming of the discriminators. So I used brown
and penn-tb
and they would point to http://vocab.lappsgrid.org/ns/tagset/pos#brown
and http://vocab.lappsgrid.org/ns/tagset/pos#penntb (don't bother clicking the links, the pages do not exist yet).
But I think when it comes to adding all these tagsets for different kinds of applications they should really be named pos-brown
and pos-penntb
. Later we may need to add discriminators for tools that have their own tag sets for each application. Let's say for the sake of argument that Stanford is one of them, we could have pos-stanford
, ner-stanford
, etcetera.
... and even different versions of a tool / model can come with different tagsets ;)
Yes, so we need to be prepared to have pos-penntb-tt-v1
and pos-penntb-tt-v2
for two different treebank-inspired tagsets for the TreeTagger (I am using this example because I vaguely remember that at some point about 5-10 years ago the tags produced by the TreeTagger changed).
Why did you delete the download links to the binaries for the various tools from the README? Users shouldn't be required to build three different projects from source simply to generate the vocabulary pages.
To address the other points raised:
We should agree on a common naming convention so we can safely generate set names with must (much) fear of name collisions.
Something like <vendor>-<family>-<versio>
I think having generic names for the most common tag sets is also a good idea as some tools are not that specific about what version of a tag set they are using, and some tag sets are not properly versioned, e.g. I think the Stanford tools make their own proprietary changes to the Penn TB tag set.
Examples
Why did I delete those links? Good question. This was either done as a side effect of many cuts and pastes or more likely because those links were a bit redundant since the repository itself includes the jars that are downloaded. So if we have both the links and jars in the repository we have a potential for those jars to be out of sync and each time a jar is updated at the download side it needs to be updated in the repository and vice versa.
To address the other points raised:
- We already have pos in the URL path, I don't think it needs to be repeated in the tag set name.
- We should agree on a common naming convention so we can safely generate set names with must (much) fear of name collisions.
Something like vendor-family-version
Yes, pos is in the path, and yes, we need a convention. I still think is is attractive to use pos-penntb
instead of penntb
since we may also need to add cat-penntb
for the categories and I prefer having the pair pos-penntb
-cat-penntb
over the pair penntb
- cat-penntb
.
We already have pos in the URL path, I don't think it needs to be repeated in the tag set name.
I'm with @ksuderman on this.
we may also need to add cat-penntb for the categories and I prefer having the pair pos-penntb-cat-penntb over the pair penntb - cat-penntb.
I believe we'll have a pair of (http://../pos#penntb
- http://.../cat#penntb
), at least in machine generated LIF
s.
By the way, the cat
is for labels of non-terminal constituents in ps-parsing, right?
Hmmm, I appear to be outvoted so how I had it originally may be fine. We should still talk about the naming convention when we get a chance.
Yes, cat
is for non-terminals.
AFAIK the repositories do not contain the downloadable jars, the releases page contains zip files and tarballs for the entire repo not the jar files. The links I added are to the package-latest.tgz files on downloads.lappsgrid.org so should always retrieve the latest executable as they get deployed as part of the release process
I think @marcverhagen was referring binaries in https://github.com/lapps/vocabulary-pages/tree/master/bin . I agree with @marcverhagen that, with the current way, it coudl be really easily outsynced. Unless we update this repository everytime ghc
, vocab-dsl
, or discriminator-dsl
gets an update, I think we should remove the bin
directory and stay with the links to downloads.lappgrid.org
.
Right, I forgot about the bin
directory. However, I don't think it is a bad idea to have both and there shouldn't be too much danger of things getting out of sync. I just checked the three projects and each of them will upload the latest tarballs to http://www.anc.org/downloads and then generate a pull request here with the latest .jar file.
On reflection I think I'll revert my commit and merge this PR as is was after Marc's last commit.
Trying to make sense of tag sets and reviving the pos-penntb
discussion a bit here...
We may have been talking about two different things. There is the discriminator name and the full URI it refers to and for the Penn TB tagset we could have penntb
as the short discriminator name and http://vocab.lappsgrid.org/ns/tagset/pos#penntb
as the URI. Clearly the last little bit of the URI does not to repeat the fact that we have a pos
, but I would not want to use penntb
for the discriminator name but something that allows for different PennTB tagsets (pos, categories...).
The following is what's now in the develop branch.
discriminator | URI |
---|---|
tags-pos | http://vocab.lappsgrid.org/ns/tagset/pos |
tags-pos-brown | http://vocab.lappsgrid.org/ns/tagset/pos#brown |
tags-pos-penntb | http://vocab.lappsgrid.org/ns/tagset/pos#penntb |
tags-pos-penntb-tt | http://vocab.lappsgrid.org/ns/tagset/pos#penntb-tt |
This does not deal yet with Stanford having its own version(s) of PennTB tags. There was a proposal to use penntb
(ptb
actually, somewhat shorter) for the generic PennTB tags versus stanford-penntb
. I would like to make a pitch to swap the latter and use penntb-stanford
.
We have a list of tagsets with identifiers assigned opportunistically in DKPro Core . Those could give you an overview over variations that we have encountered so far. It would also be nice if we could align our respective IDs in some way.
The IDs are actually tuples [layer, tagset]
, e.g. pos, alpino-ixa
for the variant of the Alpino tagset used by the IXA tools for POS annotation.
Yes, I have gone through that list and actually referred to some URLs in there. We have the layer upfront, then something that looks like your tagset, so we could easily have something like: pos-alpino-ixa
. We haven't thought a lot about language though, which appears to be the third element of your tuple. My first hunch would be to put that in the description and not have it as part of the ID.
Nice :) and right wrt the langauge :)
I don't think we use the language right now for anything else than displaying it in the documentation and I also believe that so far we did not encounter a tagset which has language-specific variations (but I would not be surprised if these exist). The models in DKPro Core use [tool, language, variant]
as the identifier tuple, so I guess it simple seemed reasonable at the time to include the language in the tagset IDs as well.
we did not encounter a tagset which has language-specific variations
We do currently consider the UD POS tagsets for each language to be separate even tough they are actually the same. But I do believe that UD actually defines language specific variations of some tags.
Yeah, not sure what, if anything, we want to do with a list like that. The features are the same, but many values occur in some languages only.
We do currently consider the UD POS tagsets for each language to be separate even tough they are actually the same
A better example for multiple language-specific tag sets with the same tagset identifier is Stein, where the French and Italian versions are different.
So we have a question to answer: do we want to use en
as part of the name for all our current proposed identifiers (for example tags-pos-en-penntb
)? And if so what position?
I think not actually. Those things will likely not occur very often and I would rather just add a language marker at the end if needed, roughly where we would put our version: tags-pos-pentb
for the simple case and tags-pos-stein-it
and tags-pos-stein-fr
for when we need the distinction.
The core here is that some pos tag sets were added to the meta data, the rest is bookkeeping.