A small Python 3 script to download public Tweets from Twitter accounts into a format suitable for AI text generation tools (such as gpt-2-simple for finetuning GPT-2).
You can view examples of AI-generated tweets from datasets retrieved with this tool in the /examples
folder.
First, clone this repository onto your system and install dependencies with the following commands:
git clone https://github.com/sdelgadoc/download-tweets-ai-text-gen-plus.git
cd download-tweets-ai-text-gen-plus
pip3 install -r requirements.txt
Previous versions of this code used scraping libraries to collect tweets. Since then, Twitter has made scraping harder while providing more robust tweet collection API's. In response, we ported this code to run only with the Twitter's API.
To continue the setup, create a Twitter app so you can obtain access to the Twitter API. Once you create an app, generate access tokens, and input them into the section of the keys.py
file shown below.
keys = {'consumer_key': "",
'consumer_secret': "",
'access_token': "",
'access_token_secret': ""}
Finally, go to the Twitter API's Dev environments page, generate a Dev environment for the Full Archive API, and input the environment's name into label
section of the keys.py
file shown below.
label = ""
The script is run via a command line interface. After cd
ing into the directory where the script is stored in a terminal, run:
python3 download_tweets.py <twitter_username> 100
e.g. If you want to download 100 tweets (sans retweets/replies/quote tweets) from Twitter user @santiagodc, run:
python3 download_tweets.py santiagodc 100
NOTE: The Twittter API's free tier has a collection limit of 5,000 tweets per month, so set a tweet limit to avoid hitting your limit too quickly
The script can can also download tweets from multiple usernames at one time. To do so, first create a text file (.txt) with the list of usernames. Then, run script referencing the file name:
python3 download_tweets.py <twitter_usernames_file_name> 100
The tweets will be downloaded to a single-column CSV titled <usernames>_tweets.csv
.
The parameters you can pass to the command line interface (positionally or explicitly) are:
@
user tags in the tweet text [default: False]#
hashtags in the tweet text [default: False]The sentiment parameter adds a sentiment category to the tweet text. This information allows the user to train and generate text with different sentiments by changing a parameter.
The output format using the 'simple' text format is the following:
[Sentiment category]
[Tweet text for the tweet that was collected]
The sentiment parameter accepts an integer that specifies the number of sentiment categories that are returned. The sentiment categories for the different possible parameters are the following:
The code supports collecting tweets in a format for training an AI that can reply to other tweets. The output format is based on the format used to train the Subreddit Simulator Reddit community.
The output format is the following:
****ARGUMENTS
ORIGINAL or REPLY: Whether the tweet is an original tweet or a reply
SENTIMENT: If the sentiment parameter is used, text describing the tweet text's sentiment
****PARENT
[Tweet text for the topmost tweet in a reply thread]
****IN_REPLY_TO
[Tweet text for the tweet that is being responded to]
****TWEET
[Tweet text for the tweet that was collected]
To collect tweets with this reply format by running the following statement:
python3 download_tweets.py <twitter_username> None True False False False 3 reply
By specifying a date, the script will download tweets from the value timeframe
to the present. It will by default download every tweet from a given user (or users) starting from the day March 22nd, 2006, the day the first tweet ever was sent.
The timeframe
parameter is precise, in which it lets you put in a desired year, month, day, hour, and minute to download tweets from, in that order. The format the timeframe
parameter accepts looks like YYYYMMDDHHMM
.
gpt-2-simple has a special case for single-column CSVs, where it will automatically process the text for best training and generation. (i.e. by adding <|startoftext|>
and <|endoftext|>
to each tweet, allowing independent generation of tweets)
You can use this Colaboratory notebook (optimized from the original notebook for this use case) to train the model on your downloaded tweets, and generate massive amounts of Tweets from it. Note that without a lot of data, the model might easily overfit; you may want to train for fewer steps
(e.g. 500
).
When generating, you'll always need to include certain parameters to decode the tweets, e.g.:
gpt2.generate(sess,
length=200,
temperature=0.7,
prefix='<|startoftext|>',
truncate='<|endoftext|>',
include_prefix=False
)
Santiago Delgado (@santiagodc) based on download-tweets-ai-text-gen by @minimaxir
MIT
This repo has no affiliation with Twitter Inc.