Share dataset of Tweets about the novel coronavirus.

lschmelzeisen commented 4 years ago

I am currently in the process of using NASTY to retrieve all Tweets about the ongoing coronavirus. As presumably, many others are also doing so, in my view it is best to concentrate crawling efforts in one location and then share the results publicly.

Therefore, I here document my current methodology and am open to suggestions/criticism:

The main goal is finding as many on-topic Tweets (i.e., talking about the novel coronavirus) as possible while including as few off-topic Tweets as is achievable. That means striking a sensible balance between precision and recall.
To this end, search requests are used to find Tweets containing at least one of the keywords corona, coronavirus, covid, covid19, ncov, sars, wuhan that were authored after 1 Dec 2019 in either English or German.
- For this I am issuing search requests per day (using the nasty search --daily feature), i.e., a single search request would be with query corona, time range from 1 Dec 2019 to 2 Dec 2019 using both the TOP and LATEST --filters. The next request for the following day and so on. Based on initial experiments this seems to yield more results and can easily be expanded on later, but more investigation on Twitter's search algorithm would be useful here.
- This will result in some off-topic matches (for example, corona beer), but these should be negligible, as the assumption is that there have been many more on-topic in the recent times (starting mid January).
- The December 2019 time span is included as a short period before the outbreak of the coronavirus to have a baseline of Tweet frequencies that are off-topic.
- I assume that most people reading this are only interested in the English Tweets, but since I am retrieving German Tweets for a personal research project, I will include these anyways. Tweets will be separated by language, so non-English ones can easily be filtered out.
Additional ways to retrieve on-topic Tweets would be to either manually identify a number of Twitter users that mostly tweet about corona (we can't just follow anyone that has tweeted about corona at one point in time as that presumably leads to a huge precision loss) or to retrieve replies to a known on-topic Tweet (e.g., any Tweet matching the above search criteria). However, both were deemed to cost expensive for now.
One thing I may do in the future, is looking up influential hash tags for each week (e.g. #masks4all) and add search requests for these.

So far, I have crawled about 68.5 million English and 2.2 million German Tweets in the time span from 1 Dec 2019 to 5 Apr 2020 (about 34 GB compressed JSON with meta data). I plan to contentiously expand this collection for the upcoming months. I am note quite sure when I'm ready to share this and how I will do so (probably using NASTY's idify feature).

If you are interested in this dataset, please leave a comment here. Preferably also leave a very short summary of what you plan to do with it and what you think of the outlined methodology.

olgapapa commented 4 years ago

We are interested in the dataset of Tweets about the novel coronavirus. The dataset seems useful, it is very important that you have tried striking a sensible balance between precision and recall. We are working on tackling the challenge of misinformation and we would like to use the dataset in our investigation. The lack of annotated data is a vital challenge in our research. However, the Tweets might be used as weak labels and an analysis of them can lead us to important conclusions about how coronavirus misinformation spread. Could you please give us more information about when you will be ready to share the dataset and what format the tweets will have?

lschmelzeisen commented 4 years ago

I have finally found the time to publish the Tweets I have retrieved up until now: https://github.com/lschmelzeisen/nasty-ncov-tweets

I realize that in the meanwhile quite a few other datasets have been released, but this one might still be interesting to you. I will look to expand it in the near future.

Let me know if you have any questions.

lschmelzeisen / nasty

Share dataset of Tweets about the novel coronavirus. #8