This repo contains the annotated corpus and code for paper ``Extracting COVID-19 Events from Twitter".
@misc{zong2020extracting,
title={Extracting COVID-19 Events from Twitter},
author={Shi Zong and Ashutosh Baheti and Wei Xu and Alan Ritter},
year={2020},
eprint={2006.02567},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
2020-11-13: To get access to our dataset, please write e-mail to zong.56@osu.edu. Thanks!
In this work, we aim at extracting 5 types of events from Twitter: (1) tested positive, (2) tested negative, (3) can not test, (4) death and (5) cure and prevention. The following table provides statistics for our current corpus.
Event Type | # of Annotated Tweets | # of Slots |
---|---|---|
Tested positive | 2500 | 9 |
Tested negative | 1200 | 8 |
Can not test | 1200 | 5 |
Death | 1300 | 6 |
Cure & prevention | 1300 | 3 |
All annotated tweets are stored in .jsonl file under data
folder. Our annotated corpus is released in the following format.
{'id': '1238504197319995397',
'candidate_chunks_offsets':
[[0, 19], [27, 52], [42, 65], [101, 112],
[0, 9], [13, 19], [27, 36], [42, 52],
[56, 65], [89, 91], [96, 112], [117, 121]],
'annotation':
{'part1.Response': ['yes'],
'part2-age.Response': ['Not Specified'],
'part2-close_contact.Response': ['Not Specified'],
'part2-employer.Response': ['Not Specified'],
'part2-gender.Response': ['Not Specified'],
'part2-name.Response': [[101, 112], [0, 9]],
'part2-recent_travel.Response': ['Not Specified'],
'part2-relation.Response': ['Not Specified'],
'part2-when.Response': [[13, 19]],
'part2-where.Response': [[56, 65]]}
}
a_single_tweet = api.get_status(id='id_for_tweet', tweet_mode='extended')
tweet_text_we_use = a_single_tweet['full_text']
We provide a script to download tweets by using tweepy. Prepare your Twitter API keys and tokens, and then run
python download_data.py --API_key your_API_key
--API_secret_key your_API_secret_key
--access_token your_access_token
--access_token_secret your_access_token_secret
Please allow the script to run for a while. The downloaded tweets will be under data
folder, named downloaded_tweets.jsonl
.
We use Twitter tagging tool for tokenization.
We suggest using tagging tool in following way, which reads in json line format files and directly appends 'tags' field into the original file. Please make sure there is a 'text' field for each line (we have already added this field if you use our download_data.py
script). Please use python2
to run this tagging tool.
cat PATH_TO_downloaded_tweets.jsonl | python2 python/ner/extractEntities2_json.py --pos --chunk
> PATH_TO_downloaded_tweets-tagging.jsonl
Once you get the tagging file, store it under data
folder, named downloaded_tweets-tagging.jsonl
. Then run the following command
python load_data.py
This script will add tweet text and tags into original annotations.
To predict the structured information (slots) within a tweet, we setup a binary classification task, where given the tweet t
and candidate slot s
the classifier f
has to predict whether the slot correctly answers the question about the tweet or not f(t,s) -> 0,1
.
We experiment with Logistic Regression baseline and BERT-based classifier.
s
in the tweet t
with a special symbol <Q_TOKEN>
and then makes the binary prediction for each slot filling task using word n-gram features (n = 1,2,3). Model code at model/logistic_regression_baseline.py
.s
in the tweet t
inside special entity markers start and end markers, <E>
and </E>
respectively. The BERT hidden representation of the entity start marker <E>
is used to predict the final label for each task. We also share the BERT model across slot-filling task in each event type (since multiple slots within each event are related to each other). Model code at model/multitask_bert_entity_classifier.py
.To recreate all the Logistic Regression experiments results in the paper run
python automate_logistic_regression_baseline_experiments.py
To recreate all the BERT classifier experiments results in the paper run
python automate_multitask_bert_entity_classifier_experiments.py
Both automate_...
scripts will first preprocess the data files, then train the classifiers if they haven't and finally consolidate all the results into a single TSV file. For Logistic Regression the final results will be saved at results/all_experiments_lr_baseline_results.tsv
and for BERT classifier the results will be saved at results/all_experiments_multitask_bert_entity_classifier_fixed_results.tsv
sklearn
scipy==1.4.1
transformers==2.9.0
tqdm
torch==1.5.0
We are organizing a shared task on COVID-19 event extraction from Twitter by using our annotated corpus at EMNLP 2020 Workshop on User-generated Text. The system description papers will be peer-reviews and published as part of the EMNLP 2020 Workshop Proceedings (ACL Anthology).
Check shared_task
folder for the provided baseline models and evaluation scripts for the shared task.