We use dependency triples automatically extracted from a Web-scale corpus to perform unsupervised semantic frame induction. We cast the frame induction problem as a triclustering problem that is a generalization of clustering for triadic data. Our replicable benchmarks demonstrate that the proposed graph-based approach, Triframes, shows state-of-the art results on this task on a FrameNet-derived dataset and performing on par with competitive methods on a verb class clustering task.
conda install faiss-cpu -c pytorch
in Anaconda)conda install gensim
in Anaconda)make watset.jar
)pip install chinese-whispers
)On macOS, developer tools must be installed first using xcode-select --install
.
Triframes inputs a set of dependency triples and outputs a set of triframes. The data is processed in two steps. First, a word embedding model is used to create a triple graph. Then, a fuzzy graph clustering algorithm, Watset, is used to extract triple communities representing triframes.
The input data used in our experiments can be obtained using the make data
command. Our default input file, vso-1.3m-pruned-strict.csv
, has four fields: verb, subject, object, weight. Loading the whole file can take a lot of memory, so our scripts support specifying a threshold using the WEIGHT
environment variable. In our experiments, it is set to zero.
Since Triframes uses word embeddings, it is reasonable to download a model. In our experiments, we used the standard Google News embeddings. In case you do not have them, it is possible to download them using make GoogleNews-vectors-negative300.bin
. There are two ways of specifying which word embeddings Triframes should use:
W2V=/path/to/embeddings.bin
environment variable to each make
invocation.PYRO=PYRO:…@…:9090
environment to each make
invocation. It is much faster than loading the Word2Vec data on every run.In case nothing is set, Triframes falls back to PYRO=PYRO:w2v@localhost:9090
.
WEIGHT=10000 VSO=vso-1.3m-pruned-strict.csv make triw2v-watset.txt
WEIGHT=10000 VSO=vso-1.3m-pruned-strict.csv make triw2v.txt
WEIGHT=10000 K=10 make trikmeans.txt
WEIGHT=10000 K=10 make trispectral.txt
make depcc-common-triples-full-singleton.tsv
make depcc-common-triples-full-whole.tsv
Assuming the command is launched on the ltcpu3 server:
$ ./fi/eval/extract_xml_framenet_roles.py /home/panchenko/verbs/frames/framenet/fndata-1.7/lu_fulltext/ fi/eval/roles-xml2.csv
$ ./fi/eval/extract_conll_framenet_roles.py /home/panchenko/verbs/frames/parsed_framenet-1.0/collapsed_dependencies/lu_fulltext_merged/ fi/eval/output/ > fi/eval/output/paths.txt
Our data are available to download on the Releases page.
@article{Ustalov:19:cl,
author = {Ustalov, Dmitry and Panchenko, Alexander and Biemann, Chris and Ponzetto, Simone Paolo},
title = {{Watset: Local-Global Graph Clustering with Applications in Sense and Frame Induction}},
journal = {Computational Linguistics},
year = {2019},
volume = {45},
number = {3},
pages = {423--479},
doi = {10.1162/COLI_a_00354},
publisher = {MIT Press},
issn = {0891-2017},
language = {english},
}