tugrulz / CensoredTweets

For Reproducing the Paper "A Dataset of State-Censored Tweets"
4 stars 2 forks source link

Archive data link dead? #1

Open MokeEire opened 3 years ago

MokeEire commented 3 years ago

The readme links to two places for the archive data:

  1. https://archive.org/details/twitterstream&morf=year
  2. https://archive.org/download/archiveteam-twitter-stream-2018-05/twitter-2018-05-02.tar

The second links to a tar file which I could download, but the first links to a blank screen (unsure if we are supposed to replace parts of the URL like year). In either case, how would one collect the archive data for other years?

tugrulz commented 3 years ago

Hi, thanks for the interest.

You can try this link: https://archive.org/details/twitterstream

To collect the archive data for other years, one needs to call wget for every available link. I do not have all of them somewhere right now, so I keep this issue open, compile and upload them later.

In any case, collecting the whole archive will be quite costly (it's 7 tb) and time consuming, it might be easier to get the data from Twitter using ids unless you are interested in the tweets of suspended users.

MokeEire commented 3 years ago

I would love to be able to get the data using IDs (presumably using the IDs in the tweets.csv file?). Currently trying to find a way to do just that with R.

tugrulz commented 3 years ago

Unfortunately I do not know R but you theoretically need to do is to load the ids into a list and then feed the api endpoint statuses/lookup with this list. Any twitter api wrapper should have a function that calls / wraps "statuses/lookup".