umd-psycholing / lm-syntactic-generalization

0 stars 0 forks source link

LM Syntactic Generalization

Getting the RNN language models

Download colorlessgreenRNNs from Gulordava et al (2018). Then download the full English vocab and English language model.

The vocab should be saved at colorlessgreenRNNs/src/data/lm (you may have to make a new folder), and the model should be saved to colorlessgreenRNNs/src/models.

Our models are available at https://huggingface.co/sathvik-n/augmented-rnns.

Downloading files to go with GRNN

Make sure you are in the root directory (not one of the subdirectories) for each of these datasets.

Vocabulary:

mkdir data/lm-data
cd data/lm-data
wget https://dl.fbaipublicfiles.com/colorless-green-rnns/training-data/English/vocab.txt

Pretrained English model:

mkdir models
cd models
wget https://dl.fbaipublicfiles.com/colorless-green-rnns/best-models/English/hidden650_batch128_dropout0.2_lr20.0.pt

CFGs and Sentence Generation

The file grammars.py contains text-based specifications for the different CFGs based on appendices from Lan et al.

Data

Example sentences generated by each of the CFGs formatted as JSON is located in grammar_outputs/sentence_lists. 2x2 tuples containing each sentence type and surprisal effect for each CFG are located in grammar_outputs/tuples, also in JSON format. The results for the pretrained model are in grammar_outputs/experiment1/grnn, the augmnted models' results are in grammar_outputs/experiment2/grnn. Wilcox et al's stimuli are listed in data/wilcox_csv, the outputs are in grammar_outputs/wilcox_replication.

Retraining the LMs

To augment the training data, run python augment_with_dependency.py --data_dir $DATA --dependency_name $DEPENDENCY --augmenting_data $CFG_DIR, where $DATAcorresponds to the LM's training data that's already split into training and validation sets (should have train.txt and valid.txt). If you downloaded data from the Gulordava et al repo this should be done for you. This will create a folder named $DEPENDENCY with augmented training data. $CFG_DIR should be set to a CSV in grammar_outputs/revised_training. Before running the line above, set the environment variable appropriately.

There are scripts to retrain the model using a cluster environment, I modified retrain_grnn.sh to train RNNs on clefting and topicalization at the same time.

Inference for the Retrained LMs

Change the path in surprisal.py from the pretrained RNN model to the model you want to use, make a directory in grammar_outputs, modify cfg_sentence_generation.py, and then run it. To evaluate the model on Wilcox et al's stimuli, run run_wilcox_replication.py.

Notebooks

Graphs for the paper are in pretrained_comparisons.ipynb, analyses for the Wilcox et al stimuli are based on comparison_wilcox.ipynb, which we used to verify our surprisal computation was working effectively. Plots and statistical analyses for the pretrained RNN are in simple_cfg_analysis.ipynb, these plots and measures for the retrained models are in retraining_analysis.ipynb.

If you use this implementation, please cite us at Katherine Howitt, Sathvik Nair, Allison Dods, and Robert Melvin Hopkins (2024). Generalizations across Filler-Gap Dependencies in Neural Language Models. Conference on Natural Language Learning (CoNLL 2024).