minimaxir / download-tweets-ai-text-gen

Python script to download public Tweets from a given Twitter account into a format suitable for AI text generation.
MIT License
221 stars 41 forks source link

Difficult time downloading all tweets #17

Open dmccaffrey12 opened 4 years ago

dmccaffrey12 commented 4 years ago

Downloading tweets always seems to get stuck somewhere. Even an account that had 620 tweets seemed to stop each time at 45%. But no other issue beyond that. All the tweets that get downloaded show up appropriately in the .csv.

Oldest Tweet: 2014-07-04 13:38:55: 38%|███████████████▌ | 14300/37538 [29:16<43:22, 8.93it/s]Expecting value: line 1 column 1 (char 0) [x] run.Feed [!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it! Expecting value: line 1 column 1 (char 0) [x] run.Feed [!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it! Expecting value: line 1 column 1 (char 0) [x] run.Feed [!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it! Expecting value: line 1 column 1 (char 0) [x] run.Feed [!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it! Oldest Tweet: 2014-07-04 13:38:55: 38%|███████████████▌ | 14300/37538 [29:40<48:12, 8.03it/s]

dmccaffrey12 commented 4 years ago

Part of me wonders if it has to do with CPU processing power and/or internet connection.

itasli commented 4 years ago

Yup I have the same problem

sdelgadoc commented 4 years ago

Although I don't have hard data behind this, I believe the issue you are seeing is due to Twitter's throttling against scrapping. I have gotten these errors when trying to get a large number of tweets, or running two instances of the script at the same time.

Two ways to reduce these errors are:

  1. Wait at least 24 hours before you run the script again. It seems the Twitter throttling isn't perpetual, and you can reduce it by stopping for some time
  2. Increase the sleep time on failure in the code to reduce how often you're hitting Twitter. In line 153 of the code, change sleep(15.0) to sleep(60.0)

Let me know if that helps.

flseaui commented 4 years ago

I'm having this same issue, I've been trying to download one accounts tweets for a few days now and had no luck. Most times I only get to 7220 tweets then it errors out, but I've at one point gotten up to 38k tweets. I've tried increasing the sleep on failure time to 60.0, then to 120.0 and it hasn't helped. I've also tried using a VPN in case it was an ip limit or something but that didn't help either.

Any ideas?

sdelgadoc commented 4 years ago

I'm having this same issue, I've been trying to download one accounts tweets for a few days now and had no luck. Most times I only get to 7220 tweets then it errors out, but I've at one point gotten up to 38k tweets. I've tried increasing the sleep on failure time to 60.0, then to 120.0 and it hasn't helped. I've also tried using a VPN in case it was an ip limit or something but that didn't help either.

Any ideas?

Two quick answers to your questions. First, 38K tweets should land you in the 3MB+ file size, which should be enough to build a decent model. So, if you still have that file, you can start using it for training.

Second, I've had the script take many hours to load tweets, but I've never had it error out. What error are you getting?

flseaui commented 4 years ago

I don't have that file anymore and haven't got anything close to it since then. I get this error a few times at exactly 7220 tweets then it ends. It has occasionally gotten past this and gone for a few thousand more tweets but not frequently. I've tested on multiple accounts and gotten the same results.

image

sdelgadoc commented 4 years ago

I don't have that file anymore and haven't got anything close to it since then. I get this error a few times at exactly 7220 tweets then it ends. It has occasionally gotten past this and gone for a few thousand more tweets but not frequently. I've tested on multiple accounts and gotten the same results.

image

Thanks for the additional detail. The errors you are getting are most likely due to Twitter's throttling. If you are getting 7,220 tweets consistently even after increasing the sleep time on failure, it is likely that is as many as you are going to get.

A work-around to get more tweets is to pull from multiple accounts that are similar to the one you originally wanted to target. This script has been updated with the ability to automate pulling from multiple accounts by passing the name of a text file (.txt) with Twitter accounts as the username.

flseaui commented 4 years ago

Yeah that's what I thought too but for whatever reason my second attempt this morning got to 82%, I checked and I think that number is off for some reason because it got every tweet, regardless its more than enough.

ezeugorobot commented 4 years ago

I'm having the same problem. I get stuck at 680 tweets, even though I increased the sleep time.

sdelgadoc commented 4 years ago

Only collecting 680 tweets seems to be on the low side. Let me know what account you're trying to collect and I can test on my side.

ezeugorobot commented 4 years ago

Only collecting 680 tweets seems to be on the low side. Let me know what account you're trying to collect and I can test on my side.

@naval, is another account where I don't get many tweets. It stops at 960. https://github.com/sdelgadoc/download-tweets-ai-text-gen-plus/blob/master/download_tweets.py

sdelgadoc commented 4 years ago

I wasn't able to reproduce the issue per the output below. I was able to download 7,000+ tweets at about 3 tweets/second, which is not bad.

ubuntu:~/environment/download-tweets-ai-text-gen-plus (master) $ python3 download_tweets.py naval Retrieving tweets for @naval... Oldest Tweet: 2019-05-27 07:52:04: : 7040it [34:26, 3.37it/s]

I'll walk you through my steps.

  1. Created new AWS instance
  2. Cloned the following repo, which is similar to this one but has more functionality and bug fixes, but shouldn't make a difference for this issue
  3. Upgraded pip3 with pip3 install --upgrade pip
  4. Installed requirements with pip3 install -r requirements.txt
  5. Ran the script with python3 download_tweets.py naval

Want to follow the steps above and let me know if you still see the issue?

ezeugorobot commented 4 years ago

I wasn't able to reproduce the issue per the output below. I was able to download 7,000+ tweets at about 3 tweets/second, which is not bad.

ubuntu:~/environment/download-tweets-ai-text-gen-plus (master) $ python3 download_tweets.py naval Retrieving tweets for @naval... Oldest Tweet: 2019-05-27 07:52:04: : 7040it [34:26, 3.37it/s]

I'll walk you through my steps.

  1. Created new AWS instance
  2. Cloned the following repo, which is similar to this one but has more functionality and bug fixes, but shouldn't make a difference for this issue
  3. Upgraded pip3 with pip3 install --upgrade pip
  4. Installed requirements with pip3 install -r requirements.txt
  5. Ran the script with python3 download_tweets.py naval

Want to follow the steps above and let me know if you still see the issue?

Thanks this worked well. I was able to get a lot of tweets for Naval. I think the issue for some of the other accounts I tried was the account was once not private then went private then not private again. The tweets tweeted when private seem to be where it always stops, even though the account is now not private.

Is it possible to start grabbing tweets from a specific date, instead of the earliest date? That way I can just start getting tweets from before the account went private.

sdelgadoc commented 4 years ago

Unfortunately, both scripts only support scraping tweets starting today.