uhh-lt / triframes

Unsupervised Semantic Frame Induction using Triclustering
https://doi.org/10.1162/COLI_a_00354
MIT License
9 stars 2 forks source link
clustering natural-language-processing semantic-frames semantics word-embeddings

Triframes: Unsupervised Semantic Frame Induction using Triclustering

We use dependency triples automatically extracted from a Web-scale corpus to perform unsupervised semantic frame induction. We cast the frame induction problem as a triclustering problem that is a generalization of clustering for triadic data. Our replicable benchmarks demonstrate that the proposed graph-based approach, Triframes, shows state-of-the art results on this task on a FrameNet-derived dataset and performing on par with competitive methods on a verb class clustering task.

Prerequisites

On macOS, developer tools must be installed first using xcode-select --install.

Running Triframes

Triframes inputs a set of dependency triples and outputs a set of triframes. The data is processed in two steps. First, a word embedding model is used to create a triple graph. Then, a fuzzy graph clustering algorithm, Watset, is used to extract triple communities representing triframes.

The input data used in our experiments can be obtained using the make data command. Our default input file, vso-1.3m-pruned-strict.csv, has four fields: verb, subject, object, weight. Loading the whole file can take a lot of memory, so our scripts support specifying a threshold using the WEIGHT environment variable. In our experiments, it is set to zero.

Since Triframes uses word embeddings, it is reasonable to download a model. In our experiments, we used the standard Google News embeddings. In case you do not have them, it is possible to download them using make GoogleNews-vectors-negative300.bin. There are two ways of specifying which word embeddings Triframes should use:

  1. Passing the W2V=/path/to/embeddings.bin environment variable to each make invocation.
  2. Serving the word vectors via Word2Vec-Pyro4. This requires passing the PYRO=PYRO:…@…:9090 environment to each make invocation. It is much faster than loading the Word2Vec data on every run.

In case nothing is set, Triframes falls back to PYRO=PYRO:w2v@localhost:9090.

Triframes with Watset

Triframes Chinese Whispers

Running Baselines

Extraction of the evalution dataset based on the sentences annotated using the framenet roles

Assuming the command is launched on the ltcpu3 server:

$ ./fi/eval/extract_xml_framenet_roles.py /home/panchenko/verbs/frames/framenet/fndata-1.7/lu_fulltext/ fi/eval/roles-xml2.csv
$ ./fi/eval/extract_conll_framenet_roles.py /home/panchenko/verbs/frames/parsed_framenet-1.0/collapsed_dependencies/lu_fulltext_merged/ fi/eval/output/ > fi/eval/output/paths.txt

Downloads

Our data are available to download on the Releases page.

Citation

@article{Ustalov:19:cl,
  author    = {Ustalov, Dmitry and Panchenko, Alexander and Biemann, Chris and Ponzetto, Simone Paolo},
  title     = {{Watset: Local-Global Graph Clustering with Applications in Sense and Frame Induction}},
  journal   = {Computational Linguistics},
  year      = {2019},
  volume    = {45},
  number    = {3},
  pages     = {423--479},
  doi       = {10.1162/COLI_a_00354},
  publisher = {MIT Press},
  issn      = {0891-2017},
  language  = {english},
}