Open cdsimmons opened 6 years ago
The same issue I am facing, 3.9 GB data equivalent to some kbs getting repeated.
continue same problem I searched several times and I could not find anything
Same issue. Try different queries, get 10G data with ~20 unique tweets.
I believe this issue has been fixed in version 0.9.3 with samirchar's fix https://github.com/taspinar/twitterscraper/pull/151
To test this, I have printed 'pos' argument (used to retrieve new tweets) to screen with both version 0.9.0 and 0.9.3:
With 0.9.0 it looks like:
(base) C:\Users\ataspinar\Documents>twitterscraper 'Salesforce' --begindate 2017-10-24 --enddate 2017-11-24 --lang en --poolsize 1 --output tweets-CRM.json
INFO: queries: ["'Salesforce' since:2017-10-24 until:2017-11-24"]
INFO: Querying 'Salesforce' since:2017-10-24 until:2017-11-24
TWEET-933840103071469573-933845625455734784
cm+55m-JXXIEJXDaFEbEvIFvX-JXXIvsFEsvssbXvbIv
cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv
cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv
cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv
cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv
cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv
cm+55m-JXXIvDaDXDbavFJsbX-JXXIvsFEsvssbXvbIv
INFO: Program interrupted by user. Returning tweets gathered so far...
INFO: Got 160 tweets for 'Salesforce'%20since%3A2017-10-24%20until%3A2017-11-24.
INFO: Program interrupted by user. Returning all tweets gathered so far.
and with version 0.9.3:
(base) C:\Users\ataspinar\Documents>twitterscraper 'Salesforce' --begindate 2017-10-24 --enddate 2017-11-24 --lang en --poolsize 1 --output tweets-CRM.json
INFO: queries: ["'Salesforce' since:2017-10-24 until:2017-11-24"]
INFO: Querying 'Salesforce' since:2017-10-24 until:2017-11-24
TWEET-933840103071469573-933845625455734784
cm%2B55m-JXXIEJXDaFEbEvIFvX-JXXIvsFEsvssbXvbIv
cm%2B55m-JXXIEXvsFJbXFDFJaI-JXXIvsFEsvssbXvbIv
cm%2B55m-JXXIabIDIEIFXXJDbX-JXXIvsFEsvssbXvbIv
cm%2B55m-JXXIaXaIbIDbsXaDaa-JXXIvsFEsvssbXvbIv
cm%2B55m-JXXIDJbsFbsXvIsIEv-JXXIvsFEsvssbXvbIv
cm%2B55m-JXXIDsIbIEXIFbJDvX-JXXIvsFEsvssbXvbIv
I left the scraper running for the night, came back to 1.5GB of tweet data. I believe at least 99% of those tweets are duplicates.
This is my command line query:
This is me logging out the URLs:
You can see it starts off okay, first request is fine, but then at some point the max_position remains the same for every subsequent request. I'm seeing this on GetOldTweets too (another twitter scraper).
Has Twitter caught on and made automation more complicated perhaps?