twintproject / twint

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
MIT License
15.68k stars 2.71k forks source link

Scraping Stopping at between 8k and 12k tweets #67

Closed bsodxp closed 6 years ago

bsodxp commented 6 years ago

First of all, thank you for such a great tool!

Description of Issue

I am scraping an entire timeline of a user w/ 150K+ tweets. The output runes fine for 8-12k tweets (usually about 30 days), and then stops. I have seen previous reports of scraping stopping, but I'm not sure if they are related. I have been adding the --until flag and working through approx 30 - 60 days at a time. Is this a bug, user error, or a Twitter limitation? Is there an easy way to batch commands together and run in chucks w/ --since and --until boundaries?

OS Details

Ubuntu / Buscador OSINT

Initial Check

Make sure you've checked the following.

Command

python3 twint.py -u username -o timeline.csv --csv

Once it fails, I then add a --until flag: python3 twint.py -u username --until 2017-06-01 -o timeline.csv --csv

Thank you!

pielco11 commented 6 years ago

Hmm it seems that even adapting the mitigation applied for #66 , twint filter out some tweets. So I removed that and now it works. Unfortunately the problem is from the Twitter side, the mitigation (read check date) will be moved somewhere else.

This means that if one is looking for tweets since:2017-10-02 he/she will get some tweets since:2017-10-01 (not so much).

pielco11 commented 6 years ago

For a more detailed explanation of the problem see #66

pielco11 commented 6 years ago

@bsodxp now twint should work as expected

haccer commented 6 years ago

I am scraping an entire timeline of a user w/ 150K+ tweets. The output runes fine for 8-12k tweets (usually about 30 days), and then stops.

I'm starting to think this isn't an issue w/ the script, but probably a user's connection issues. I just scraped someone's feed w/ 150k tweets and had no issues

pielco11 commented 6 years ago

There were a problem with the script when the difference between until and since is less than default timedelta. There is a problem, Twitter side, with the search: Twitter returns tweets before the since date.

I think that if the issue would be with the connection, I'll expect timeout errors or stuff like this, but this was not reported. I was able to reproduce the conditions such for which the issue is expected to happen. Now it works as expected, at least this is what I'm having

haccer commented 6 years ago

Well, I'm now starting to think that issues #60 & #48 are also connection issues (I think with the try and excepts, there wouldn't be a timeout error.)

I'm scraping #metoo right now, and was already able to scrape past a few months so far within a few minutes.... going to see how long this goes

pielco11 commented 6 years ago

Yes, you are right... maybe we should try handling connection issues (maybe ClientConnectionError) and see what happens