sdelgadoc / download-tweets-ai-text-gen-plus

Python script to download public Tweets from a given Twitter account into a format suitable for AI text generation
MIT License
35 stars 8 forks source link
ai-generated-tweets ai-models gpt gpt-2 text-generation tweets twitter

download-tweets-ai-text-gen-plus

A small Python 3 script to download public Tweets from Twitter accounts into a format suitable for AI text generation tools (such as gpt-2-simple for finetuning GPT-2).

You can view examples of AI-generated tweets from datasets retrieved with this tool in the /examples folder.

Setup

First, clone this repository onto your system and install dependencies with the following commands:

git clone https://github.com/sdelgadoc/download-tweets-ai-text-gen-plus.git
cd download-tweets-ai-text-gen-plus
pip3 install -r requirements.txt

Previous versions of this code used scraping libraries to collect tweets. Since then, Twitter has made scraping harder while providing more robust tweet collection API's. In response, we ported this code to run only with the Twitter's API.

To continue the setup, create a Twitter app so you can obtain access to the Twitter API. Once you create an app, generate access tokens, and input them into the section of the keys.py file shown below.

keys = {'consumer_key': "",
        'consumer_secret': "",
        'access_token': "",
        'access_token_secret': ""}

Finally, go to the Twitter API's Dev environments page, generate a Dev environment for the Full Archive API, and input the environment's name into label section of the keys.py file shown below.

label = ""

Usage

The script is run via a command line interface. After cding into the directory where the script is stored in a terminal, run:

python3 download_tweets.py <twitter_username> 100

e.g. If you want to download 100 tweets (sans retweets/replies/quote tweets) from Twitter user @santiagodc, run:

python3 download_tweets.py santiagodc 100

NOTE: The Twittter API's free tier has a collection limit of 5,000 tweets per month, so set a tweet limit to avoid hitting your limit too quickly

The script can can also download tweets from multiple usernames at one time. To do so, first create a text file (.txt) with the list of usernames. Then, run script referencing the file name:

python3 download_tweets.py <twitter_usernames_file_name> 100

The tweets will be downloaded to a single-column CSV titled <usernames>_tweets.csv.

The parameters you can pass to the command line interface (positionally or explicitly) are:

How does the sentiment functionality work

The sentiment parameter adds a sentiment category to the tweet text. This information allows the user to train and generate text with different sentiments by changing a parameter.

The output format using the 'simple' text format is the following:

[Sentiment category]
[Tweet text for the tweet that was collected]

The sentiment parameter accepts an integer that specifies the number of sentiment categories that are returned. The sentiment categories for the different possible parameters are the following:

How does the text_format functionality work

The code supports collecting tweets in a format for training an AI that can reply to other tweets. The output format is based on the format used to train the Subreddit Simulator Reddit community.

The output format is the following:

****ARGUMENTS
ORIGINAL or REPLY: Whether the tweet is an original tweet or a reply
SENTIMENT: If the sentiment parameter is used, text describing the tweet text's sentiment
****PARENT
[Tweet text for the topmost tweet in a reply thread]
****IN_REPLY_TO
[Tweet text for the tweet that is being responded to]
****TWEET
[Tweet text for the tweet that was collected]

To collect tweets with this reply format by running the following statement:

python3 download_tweets.py <twitter_username> None True False False False 3 reply

How does the timeframe functionality work

By specifying a date, the script will download tweets from the value timeframe to the present. It will by default download every tweet from a given user (or users) starting from the day March 22nd, 2006, the day the first tweet ever was sent. The timeframe parameter is precise, in which it lets you put in a desired year, month, day, hour, and minute to download tweets from, in that order. The format the timeframe parameter accepts looks like YYYYMMDDHHMM.

How to Train an AI on the downloaded tweets

gpt-2-simple has a special case for single-column CSVs, where it will automatically process the text for best training and generation. (i.e. by adding <|startoftext|> and <|endoftext|> to each tweet, allowing independent generation of tweets)

You can use this Colaboratory notebook (optimized from the original notebook for this use case) to train the model on your downloaded tweets, and generate massive amounts of Tweets from it. Note that without a lot of data, the model might easily overfit; you may want to train for fewer steps (e.g. 500).

When generating, you'll always need to include certain parameters to decode the tweets, e.g.:

gpt2.generate(sess,
              length=200,
              temperature=0.7,
              prefix='<|startoftext|>',
              truncate='<|endoftext|>',
              include_prefix=False
              )

Helpful Notes

Maintainer

Santiago Delgado (@santiagodc) based on download-tweets-ai-text-gen by @minimaxir

License

MIT

Disclaimer

This repo has no affiliation with Twitter Inc.