rycolab / homophony-as-renyi-entropy

This code accompanies the paper "On Homophony and Rényi Entropy" published in EMNLP 2021.
MIT License
3 stars 0 forks source link



This code accompanies the paper On Homophony and Rényi Entropy (Pimentel et al., EMNLP 2021). It is a study of the pressures of homophony in language, analysing homophony through the lens of the Rényi collision entropy.


Download the CELEX data and place the raw LDC96L14.tar.gz file into data/celex/raw/ path. You can then extract its data with command:

$ make get_celex


To install dependencies run:

$ conda env create -f environment.yml

Activate the created conda environment with command:

$ source activate.sh

Finally, install the appropriate version of pytorch:

$ conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
# $ conda install pytorch torchvision cpuonly -c pytorch

Preprocess data

To preprocess a language's data run:

$ make get_data MONOMORPHEMIC=True LANGUAGE=<language>

where language can be one of: eng (English), deu (German), or nld (Dutch).

Train models

To train a language's phonotactic model run:

$ make train MONOMORPHEMIC=True LANGUAGE=<language> MODEL=<model>

where model can be one of: lstm, or ngram.

Evaluate models

There are three commands to evaluate the trained phonotactic models. The first evaluates it on the test set to get its cross-entropy:

$ make eval MONOMORPHEMIC=True LANGUAGE=<language> MODEL=<model>

The second analyses all words with probability above a threshold delta to approximate its renyi entropy:

$ make get_renyi MONOMORPHEMIC=True LANGUAGE=<language> MODEL=<model>

Finally, the third samples artificial lexica from the language models' to run the null hypothesis test:

$ make sample_renyi MONOMORPHEMIC=True LANGUAGE=<language> MODEL=<model>

Analyse models

Finally, to analyse the models and print results run:

$ make analyse MONOMORPHEMIC=True LANGUAGE=<language>

Extra Information


If this code or the paper were usefull to you, consider citing it:

    title = "On Homophony and Rényi Entropy",
    author = "Pimentel, Tiago and
    Meister, Clara and
    Teufel, Simone and
    Cotterell, Ryan",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    year = "2021",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2109.13766",


To ask questions or report problems, please open an issue.