sdelgadoc / download-tweets-ai-text-gen-plus

Python script to download public Tweets from a given Twitter account into a format suitable for AI text generation
MIT License
35 stars 8 forks source link

Skipping a block of tweets when using download_tweets.py #11

Open dafunction opened 3 years ago

dafunction commented 3 years ago

Hello!

Firstly, thanks for sharing the code for this tool! Using this as a first project to play with gpt2 and machine learning.

This "issue" is actually more of a question, but as you mentioned in the README Twitter's free tier has a collection limit of 5,000. Rather than paying for the premium tier while I'm doing this project for educational purposes only, I'm hoping to wait until my collection limit reset next month so I can collect more tweets from a particular user to train the model. In my case, I collected 100 tweets as a test, then collected 4,900 after the test was successful.

Getting to the question - is it possible to skip the block of 4,900 tweets I'll have collected and collect the next block of tweets within my collection limit once it's reset using the script as is? Scanning over it there doesn't appear to be any params defined to do so.

Perhaps bumping the fromDate from lines 123-132 up to the date of whatever the last tweet collected from the first block is? There will probably be some overlap but I'd guess that would work okay.

    if limit is not None:
        cursor = tweepy.Cursor(api.search_full_archive, 
                               environment_name=environment_name,
                               query = "from:" + username,
                               fromDate="200603220000").items(limit)
    else:
        cursor = tweepy.Cursor(api.search_full_archive,
                               environment_name=environment_name,
                               query = "from:" + username,
                               fromDate="200603220000").items()

Thanks in advance for your response.

sdelgadoc commented 3 years ago

I'm glad to hear that the code is working with you! As a developer, one always wonders if all those clones are leading to use, or to cursing when things don't work.

You are right that the code, as is, can't limit the collected tweets by date. You also correctly identified the part of the code that could be modified to limit collected tweets by date.

However, it will require adding another parameter to the tweepy.Cursor call named toDate. The code collects tweets from most recent to oldest, so if you change the fromDate, it will still start collecting from the newest tweet versus from where you ended.

To get the behavior you're looking for, you need to add the toDate parameter as shown below, and set it to the date of the last tweet you collected.

if limit is not None:
    cursor = tweepy.Cursor(api.search_full_archive, 
                               environment_name=environment_name,
                               query = "from:" + username,
                               fromDate="200603220000",
                   toDate="[DATE_OF_LAST_TWEET]").items(limit)
else:
    cursor = tweepy.Cursor(api.search_full_archive,
                               environment_name=environment_name,
                               query = "from:" + username,
                               fromDate="200603220000",
                   toDate="[DATE_OF_LAST_TWEET]").items()

Although the answer above should resolve your issue, if you're feeling generous, you could make the toDate value a parameter to the code by sending it up the download_account_tweets and download_tweets functions. :-) If you do, please submit a pull request; I'd be happy to add it to the code base.