HypeNET: Integrated Path-based and Distributional Method for Hypernymy Detection

This is the code used in the paper:

"Improving Hypernymy Detection with an Integrated Path-based and Distributional Method"
Vered Shwartz, Yoav Goldberg and Ido Dagan. ACL 2016. link

It is used to classify hypernymy relations between term-pairs, using disributional information on each term, and path-based information, encoded using an LSTM.

Version 2:

Major features and improvements:

Using dynet instead of pycnn (thanks @srajana!)
Automating corpus processing with a single bash script which is more time and memory efficient

Bug fixes:

Too many paths in parse_wikipedia (see issue #2)

To reproduce the results reported in the paper, please use V1. The current version acheives similar results - the integrated model's performance on the randomly split dataset is: Precision: 0.918, Recall: 0.907, F1: 0.912

Consider using our new project, LexNET! It supports classification of multiple semantic relations, and contains several model enhancements and detailed documentation.

Prerequisites:

Python 2.7
numpy
scikit-learn
bsddb
dynet
spacy

Quick Start:

The repository contains the following directories:

common - the knowledge resource class, which is used by other models to save the path data from the corpus.
corpus - code for parsing the corpus and extracting paths, including the generalizations made for the baseline method.
dataset - code for creating the dataset used in the paper, and the dataset itself.
train - code for training and testing both variants of our model (path-based and integrated).

To create a processed corpus, download a Wikipedia dump, and run:

bash create_resource_from_corpus.sh [wiki_dump_file] [resource_prefix]

Where resource_prefix is the file path and prefix of the corpus files, e.g. corpus/wiki, such that the directory corpus will eventually contain the wiki_*.db files created by this script.

To train the integrated model, run:

train_integrated.py [resource_prefix] [dataset_prefix] [model_prefix_file] [embeddings_file] [alpha] [word_dropout_rate]

Where:

resource_prefix is the file path and prefix of the corpus files, e.g. corpus/wiki, such that the directory corpus contains the wiki_*.db files created by create_resource_from_corpus.sh.
dataset_prefix is the file path of the dataset files, e.g. dataset/rnd, such that this directory contains 3 files: train.tsv, test.tsv and val.tsv.
model_prefix_file is the output directory and prefix for the model files. The model is saved in 3 files: .model, .params and .dict. In addition, the test set predictions are saved in .predictions, and the prominent paths are saved to .paths.
embeddings_file is the pre-trained word embeddings file, in txt format (i.e., every line consists of the word, followed by a space, and its vector. See GloVe for an example.)
alpha is the learning rate (default=0.001).
word_dropout_rate is the... word dropout rate.

Similarly, you can train the path-based model with train_path_based.py or test any of these pre-trained models using test_integrated.py and test_path_based.py respectively.

vered1986 / HypeNET

readme

HypeNET: Integrated Path-based and Distributional Method for Hypernymy Detection

Version 2:

Major features and improvements:

Bug fixes: