own-pt / glosstag

Semantically Tagged PWN glosses
Other
7 stars 4 forks source link

-- mode:org --

+TITLE: WordNet Gloss Corpus

This is a continuation of the project originally developed at Princeton, whose original files can be found [[http://wordnetcode.princeton.edu/glosstag.shtml][this link]] or in the [[https://github.com/own-pt/glosstag/tree/princeton][princeton branch]] of this repository.

See the [[https://github.com/own-pt/glosstag/releases][releases page]] for the official releases.

This repository hosts a [[https://dl.acm.org/citation.cfm?id=1075742][semantic concordance]] -- a textual corpus (the WordNet glosses) and a lexicon (WordNet) where every content word in the corpus is linked to its sense(s) in the lexicon.

There are several corpora with WordNet sense annotations (the SemCor, the senseval datases), but only by sense-tagging the WordNet itself can we guarantee that its definitional completeness -- the property that all of its definitions only use words which are already defined by WordNet.

We are annotating the corpus using [[http://github.com/own-pt/sensetion.el/][this tool]]. It depends on the gloss corpus in the format available in this repository. Details about the format itself can be found in the [[https://github.com/own-pt/sensetion.el/][tool's repository]].

+begin_src bash

cd data split -l 1000 -a 2 annotation.json annotation- for f in annotation-??; do mv $f $f.jl; done

+end_src

Number of tokens per kind:

+begin_src bash

for f in *.jl ; do jq -r ".tokens | .[] | .kind | .[0] " $f; done | sort | uniq -c

+end_src

extracting word forms:

+begin_src bash

for f in *.new; do jq -r ".tokens|.[] |.form " $f ; done | sort | uniq -c | sort -nr >> ../words.txt

+end_src

** globs

| sense tagged | 53212 | 0.94 | | not sense tagged | 3350 | 0.06 | | total taggable | 56562 | 1.00 |

among the sense tagged ones, two last lines are errors:

| 39864 | auto | | 13348 | man |

** word forms

| sense tagged | 448091 | 0.56 | | not sense tagged | 334068 | 0.43 | | total taggable | 792281 | 1.00 |

among the token tagged:

| 126940 | auto | | 321151 | man |