taspinar / twitterscraper

Scrape Twitter for Tweets
MIT License
2.41k stars 577 forks source link

Loading same tweets over and over #149

Open cdsimmons opened 6 years ago

cdsimmons commented 6 years ago

I left the scraper running for the night, came back to 1.5GB of tweet data. I believe at least 99% of those tweets are duplicates.

This is my command line query:

twitterscraper 'Salesforce' --begindate 2017-10-24 --enddate 2017-11-24 --lang en --poolsize 1 --output tweets-CRM.json

This is me logging out the URLs:

INFO: queries: ['Salesforce since:2017-10-24 until:2017-11-24'] INFO: Querying Salesforce since:2017-10-24 until:2017-11-24 INFO: URL... https://twitter.com/search?f=tweets&vertical=default&q=Salesforce%20since%3A2017-10-24%20until%3A2017-11-24&l=en INFO: URL... https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-933840103071469573-933845625455734784&q=Salesforce%20since%3A2017-10-24%20until%3A2017-11-24&l=en INFO: URL... https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=cm+55m-JXXIEJXDaFEbEvIFvX-JXXIvsFEsvssbXvbIv&q=Salesforce%20since%3A2017-10-24%20until%3A2017-11-24&l=en INFO: URL... https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv&q=Salesforce%20since%3A2017-10-24%20until%3A2017-11-24&l=en INFO: URL... https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv&q=Salesforce%20since%3A2017-10-24%20until%3A2017-11-24&l=en INFO: URL... https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv&q=Salesforce%20since%3A2017-10-24%20until%3A2017-11-24&l=en INFO: URL... https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv&q=Salesforce%20since%3A2017-10-24%20until%3A2017-11-24&l=en

You can see it starts off okay, first request is fine, but then at some point the max_position remains the same for every subsequent request. I'm seeing this on GetOldTweets too (another twitter scraper).

Has Twitter caught on and made automation more complicated perhaps?

haripadbharti1 commented 6 years ago

The same issue I am facing, 3.9 GB data equivalent to some kbs getting repeated.

Camilo-Lesmes commented 6 years ago

continue same problem I searched several times and I could not find anything

Yaolinwang commented 6 years ago

Same issue. Try different queries, get 10G data with ~20 unique tweets.

taspinar commented 6 years ago

I believe this issue has been fixed in version 0.9.3 with samirchar's fix https://github.com/taspinar/twitterscraper/pull/151

To test this, I have printed 'pos' argument (used to retrieve new tweets) to screen with both version 0.9.0 and 0.9.3:

With 0.9.0 it looks like:

(base) C:\Users\ataspinar\Documents>twitterscraper 'Salesforce' --begindate 2017-10-24 --enddate 2017-11-24 --lang en --poolsize 1 --output tweets-CRM.json
INFO: queries: ["'Salesforce' since:2017-10-24 until:2017-11-24"]
INFO: Querying 'Salesforce' since:2017-10-24 until:2017-11-24
TWEET-933840103071469573-933845625455734784
cm+55m-JXXIEJXDaFEbEvIFvX-JXXIvsFEsvssbXvbIv
cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv
cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv
cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv
cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv
cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv
cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv
INFO: Program interrupted by user. Returning tweets gathered so far...
INFO: Got 160 tweets for 'Salesforce'%20since%3A2017-10-24%20until%3A2017-11-24.
INFO: Program interrupted by user. Returning all tweets gathered so far.

and with version 0.9.3:

(base) C:\Users\ataspinar\Documents>twitterscraper 'Salesforce' --begindate 2017-10-24 --enddate 2017-11-24 --lang en --poolsize 1 --output tweets-CRM.json
INFO: queries: ["'Salesforce' since:2017-10-24 until:2017-11-24"]
INFO: Querying 'Salesforce' since:2017-10-24 until:2017-11-24
TWEET-933840103071469573-933845625455734784
cm%2B55m-JXXIEJXDaFEbEvIFvX-JXXIvsFEsvssbXvbIv
cm%2B55m-JXXIEXvsFJbXFDFJaI-JXXIvsFEsvssbXvbIv
cm%2B55m-JXXIabIDIEIFXXJDbX-JXXIvsFEsvssbXvbIv
cm%2B55m-JXXIaXaIbIDbsXaDaa-JXXIvsFEsvssbXvbIv
cm%2B55m-JXXIDJbsFbsXvIsIEv-JXXIvsFEsvssbXvbIv
cm%2B55m-JXXIDsIbIEXIFbJDvX-JXXIvsFEsvssbXvbIv