hhexiy / pungen

A pun generator based on the surprisal principle
58 stars 11 forks source link

Pun Generation with Surprise

This repo contains code and data for the paper Pun Generation with Surprise.

Requirements

Training

Word relatedness model

We approximate relatedness between a pair of words with a long-distance skip-gram model trained on BookCorpus sentences. The original BookCorpus data is parsed by scripts/preprocess_raw_text.py and you can see the sample file in sample_data/bookcorpus/raw/train.txt.

Preprocess bookcorpus data:

python -m pungen.wordvec.preprocess --data-dir data/bookcorpus/skipgram \
    --corpus data/bookcorpus/raw/train.txt \
    --min-dist 5 --max-dist 10 --threshold 80 \
    --vocab data/bookcorpus/skipgram/dict.txt

Train skip-gram model:

python -m pungen.wordvec.train --weights --cuda --data data/bookcorpus/skipgram/train.bin \
    --save_dir models/bookcorpus/skipgram \
    --mb 3500 --epoch 15 \
    --vocab data/bookcorpus/skipgram/dict.txt

Edit model

The edit model takes a word and a template (masked sentence) and combine the two coherently.

Preprocess data:

for split in train valid; do \
    PYTHONPATH=. python scripts/make_src_tgt_files.py -i data/bookcorpus/raw/$split.txt \
        -o data/bookcorpus/edit/$split --delete-frac 0.5 --window-size 2 --random-window-size; \
done

python -m pungen.preprocess --source-lang src --target-lang tgt \
    --destdir data/bookcorpus/edit/bin/data --thresholdtgt 80 --thresholdsrc 80 \
    --validpref data/bookcorpus/edit/valid \
    --trainpref data/bookcorpus/edit/train \
    --workers 8

Training:

python -m pungen.train data/bookcorpus/edit/bin/data -a lstm \
    --source-lang src --target-lang tgt \
    --task edit --insert deleted --combine token \
    --criterion cross_entropy \
    --encoder lstm --decoder-attention True \
    --optimizer adagrad --lr 0.01 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
    --clip-norm 5 --max-epoch 50 --max-tokens 7000 --no-epoch-checkpoints \
    --save-dir models/bookcorpus/edit/deleted --no-progress-bar --log-interval 5000

Retriever

Build a sentence retriever based on Bookcorpus. The input should have a tokenized sentence per line.

python -m pungen.retriever --doc-file data/bookcorpus/raw/sent.tokenized.txt \
    --path models/bookcorpus/retriever.pkl --overwrite

Analyze what makes a pun funny

Compute correlation between local-global suprise scores and human funniness ratings. We provide our annotated dataset in data/funniness_annotation:

Generate puns

We generate puns given a pair of pun word and alternative word. We support pun generation with the following methods specified by the system argument.

Reference

If you use the annotated SemEval pun dataset, please cite our paper:

@inproceedings{he2019pun,
    title={Pun Generation with Surprise},
    author={He He and Nanyun Peng and Percy Liang},
    booktitle={North American Association for Computational Linguistics (NAACL)},
    year={2019}
}