Closed lschmelzeisen closed 4 years ago
We are interested in the dataset of Tweets about the novel coronavirus. The dataset seems useful, it is very important that you have tried striking a sensible balance between precision and recall. We are working on tackling the challenge of misinformation and we would like to use the dataset in our investigation. The lack of annotated data is a vital challenge in our research. However, the Tweets might be used as weak labels and an analysis of them can lead us to important conclusions about how coronavirus misinformation spread. Could you please give us more information about when you will be ready to share the dataset and what format the tweets will have?
I have finally found the time to publish the Tweets I have retrieved up until now: https://github.com/lschmelzeisen/nasty-ncov-tweets
I realize that in the meanwhile quite a few other datasets have been released, but this one might still be interesting to you. I will look to expand it in the near future.
Let me know if you have any questions.
I am currently in the process of using NASTY to retrieve all Tweets about the ongoing coronavirus. As presumably, many others are also doing so, in my view it is best to concentrate crawling efforts in one location and then share the results publicly.
Therefore, I here document my current methodology and am open to suggestions/criticism:
corona
,coronavirus
,covid
,covid19
,ncov
,sars
,wuhan
that were authored after 1 Dec 2019 in either English or German.nasty search --daily
feature), i.e., a single search request would be with querycorona
, time range from 1 Dec 2019 to 2 Dec 2019 using both the TOP and LATEST--filter
s. The next request for the following day and so on. Based on initial experiments this seems to yield more results and can easily be expanded on later, but more investigation on Twitter's search algorithm would be useful here.So far, I have crawled about 68.5 million English and 2.2 million German Tweets in the time span from 1 Dec 2019 to 5 Apr 2020 (about 34 GB compressed JSON with meta data). I plan to contentiously expand this collection for the upcoming months. I am note quite sure when I'm ready to share this and how I will do so (probably using NASTY's idify feature).
If you are interested in this dataset, please leave a comment here. Preferably also leave a very short summary of what you plan to do with it and what you think of the outlined methodology.