This code accompanies the paper On Homophony and Rényi Entropy (Pimentel et al., EMNLP 2021). It is a study of the pressures of homophony in language, analysing homophony through the lens of the Rényi collision entropy.
Download the CELEX data and place the raw LDC96L14.tar.gz
file into data/celex/raw/
path.
You can then extract its data with command:
$ make get_celex
To install dependencies run:
$ conda env create -f environment.yml
Activate the created conda environment with command:
$ source activate.sh
Finally, install the appropriate version of pytorch:
$ conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
# $ conda install pytorch torchvision cpuonly -c pytorch
To preprocess a language's data run:
$ make get_data MONOMORPHEMIC=True LANGUAGE=<language>
where language can be one of: eng
(English), deu
(German), or nld
(Dutch).
To train a language's phonotactic model run:
$ make train MONOMORPHEMIC=True LANGUAGE=<language> MODEL=<model>
where model can be one of: lstm
, or ngram
.
There are three commands to evaluate the trained phonotactic models. The first evaluates it on the test set to get its cross-entropy:
$ make eval MONOMORPHEMIC=True LANGUAGE=<language> MODEL=<model>
The second analyses all words with probability above a threshold delta to approximate its renyi entropy:
$ make get_renyi MONOMORPHEMIC=True LANGUAGE=<language> MODEL=<model>
Finally, the third samples artificial lexica from the language models' to run the null hypothesis test:
$ make sample_renyi MONOMORPHEMIC=True LANGUAGE=<language> MODEL=<model>
Finally, to analyse the models and print results run:
$ make analyse MONOMORPHEMIC=True LANGUAGE=<language>
If this code or the paper were usefull to you, consider citing it:
@inproceedings{pimentel-etal-2021-homophony,
title = "On Homophony and Rényi Entropy",
author = "Pimentel, Tiago and
Meister, Clara and
Teufel, Simone and
Cotterell, Ryan",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
year = "2021",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2109.13766",
}
To ask questions or report problems, please open an issue.