kolchinski / reddit-sarc

User modeling for sarcasm detection on Reddit corpus from Khodak et al. Published in EMNLP 2018.
MIT License
11 stars 6 forks source link

reddit-sarc

Sarcasm detection on Reddit corpus from Khodak et al (http://nlp.cs.princeton.edu/SARC/2.0/)

pull_all_data.sh provides commands to download, extract, and move all data to the locations expected by the code pull_small_data.sh also does so, but only for a subset

logs/ contains the logs for all of the test runs reported in the paper, as well as the script logs/significance_tests.py which was used to generate confidence intervals

src/baselines.py contains code to replicate the baselines from Khodak et al's dataset paper

src/main.py is configured for a run to test whichever "spreadsheet cell index" model/dataset combination is passed on the command line, e.g. python main.py B2. Commented-out sections in main.py provide examples for other ways to use the code.

src/rnn.py contains the actual RNN class, as well as the helper class used to train and evaluate the RNN, generate graphs etc

src/rnn_util.py contains numerous helper functions for data transformation, generating user representations, etc

src/run_askreddit_tests.sh and run_tests.sh contain examples of how to run the final tests

src/test_configs.py contains hyperparameter configurations for all results reported in the paper

src/util.py contains generic helper functions for data reading and processing

sample_predictions/ contains sample predictions from the 3 user models on holdout sets for one run each on the balanced r/politics and balanced full datasets