Code for paper "Author Profiling for Abuse Detection", in Proceedings of the 27th International Conference on Computational Linguistics (COLING) 2018
If you use this code, please cite our paper:
@inproceedings{mishra-etal-2018-author,
title = "Author Profiling for Abuse Detection",
author = "Mishra, Pushkar and
Del Tredici, Marco and
Yannakoudakis, Helen and
Shutova, Ekaterina",
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
month = aug,
year = "2018",
address = "Santa Fe, New Mexico, USA",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/C18-1093",
pages = "1088--1098",
}
Python3.5+ required to run the code. Dependencies can be installed with pip install -r requirements.txt
followed by python -m nltk.downloader punkt
The dataset for the code is provided in the _TwitterData/twitter_data_waseemhovy.csv file as a list of [tweet ID, annotation] pairs. To run the code, please use a Twitter API (_twitteraccess.py employs Tweepy) to retrieve the tweets for the given tweet IDs. Replace the dataset file with a file of the same name that has a list of [tweet ID, tweet, annotation] triples. Additionally, _twitteraccess.py contains functions to retrieve follower-following relationships amongst the authors of the tweets (specified in resources/authors.txt). Once the relationships have been retrieved, please use Node2vec (see resources/node2vec) to produce embeddings for each of the authors and store them in a file named authors.emb in the resources directory.
To run the best method (LR + AUTH):
python twitter_model.py -c 16202 -m lna
To run the other methods:
python twitter_model.py -c 16202 -m a
python twitter_model.py -c 16202 -m ln
python twitter_model.py -c 16202 -m ws
python twitter_model.py -c 16202 -m hs
python twitter_model.py -c 16202 -m wsa
python twitter_model.py -c 16202 -m hsa
For the HS and WS based methods, adding the -ft
flag to the command ensures that the pre-trained deep neural models from the Models directory
are not used and instead all the training happens from scratch. This requires that the file of pre-trained GLoVe embeddings is downloaded from
http://nlp.stanford.edu/data/glove.twitter.27B.zip, unzipped and placed in the resources directory prior to the execution.
An overview of the complete training-testing flow is as follows:
In the 10-fold CV, steps 3-7 are run 10 times (each time with a different set of tweets as the test set) and the final precision, recall and F1 are calculated by averaging results from across the 10 runs.