twintproject / twint

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
MIT License
15.68k stars 2.71k forks source link

Missing tweets #474

Closed ghost closed 5 years ago

ghost commented 5 years ago

Shorter version (tl;dr)

twint returns much less tweets than the number of tweets displayed on the Twitter page of a user.

Example with @BouloGiletJaune:

Longer version

Description

On Linux fedora29, using the latest twint 1.2.3 with Python 3.7 in a brand new virtual environment, when I run this command it returns about only half the tweets available:

$ twint --retweets -u BouloGiletJaune | wc -l
171

However this particular Twitter user has posted 278+ tweets: https://twitter.com/BouloGiletJaune?lang=en Missing tweets occur with all the twitter accounts I could try.

Using Profile_full=True

With the Profile() function and Profile_full=True twint returns more tweets (272 tweets) yet not the right amount (278), but it's slow as hell:

$ time twint --profile-full --retweets -u BouloGiletJaune | wc -l
CRITICAL:root:twint.feed:Mobile:list index out of range
CRITICAL:root:twint.feed:Mobile:list index out of range
272

real    3m9.178s
user    1m11.755s
sys 0m1.081s

Profile_full is not a solution

Missing 6 tweets doesn't seem like a big issue, but with an account that has many more tweets (35k) --profile-full) still misses about 2,000 tweets, not mentioning the hours it takes to complete. So it's definitely not a viable workaround.

The --all option doesn't seem to work

To be noted: the --all command line option is supposed to return *all* tweets associated with a user, but it doesn't seem to work:

(virtualenv) $ twint --all BouloGiletJaune
[-] Error: Please use at least -u, -s, -g or --near.

Your help is much appreciated

So, can you please let me know what I'm doing wrong or if you spot a problem? Maybe it's some known limitation?

Thank you.

Technical details

Installation

I've installed twint using this command from within the python3 virtual environment:

(virtualenv) $ pip3 install -e 'git+https://github.com/twintproject/twint.git@origin/master#egg=twint'

Bug signature

The number of tweets returned is too small:

$ time twint --retweets -u BouloGiletJaune | wc -l
171

real    0m15.453s
user    0m8.933s
sys 0m0.447s

pip version

(virtualenv) $ pip --version
pip 19.1.1 from /home/user/git/project/virtualenv/lib/python3.7/site-packages/pip (python 3.7)

python version

Fedora's stock version of Python is used. As with all virtualenv, binaries are automatically copied in the virtual environment.

(virtualenv) $ python -VV
Python 3.7.3 (default, May 11 2019, 00:45:16) 
[GCC 8.3.1 20190223 (Red Hat 8.3.1-2)]

pip packages installed

Only twint has a local path because it's been installed using git (see above) which is the recommended way to install the latest version.

(virtualenv) $ pip list installed
Package         Version Location                        
--------------- ------- --------------------------------
aiodns          2.0.0   
aiohttp         3.5.4   
aiohttp-socks   0.2.2   
async-timeout   3.0.1   
attrs           19.1.0  
beautifulsoup4  4.7.1   
cchardet        2.1.4   
cffi            1.12.3  
chardet         3.0.4   
elasticsearch   7.0.2   
fake-useragent  0.1.11  
geographiclib   1.49    
geopy           1.20.0  
idna            2.8     
multidict       4.5.2   
numpy           1.16.4  
pandas          0.24.2  
pip             19.1.1  
pycares         3.0.0   
pycparser       2.19    
PySocks         1.7.0   
python-dateutil 2.8.0   
pytz            2019.1  
schedule        0.6.0   
setuptools      41.0.1  
six             1.12.0  
soupsieve       1.9.2   
twint           1.2.3   /home/user/git/project/virtualenv/src/twint
urllib3         1.25.3  
wheel           0.33.4  
yarl            1.3.0   

twint package details

(virtualenv) $ pip show twint
Name: twint
Version: 1.2.3
Summary: An advanced Twitter scraping & OSINT tool.
Home-page: https://github.com/twintproject/twint
Author: Cody Zacharias
Author-email: codyzacharias@pm.me
License: MIT
Location: /home/user/git/project/virtualenv/src/twint
Requires: aiohttp, aiodns, beautifulsoup4, cchardet, elasticsearch, pysocks, pandas, aiohttp-socks, schedule, geopy, fake-useragent
Required-by: 

The machine

(virtualenv) $ uname -a
Linux work 5.1.11-200.fc29.x86_64 #1 SMP Mon Jun 17 19:30:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

The O/S

(virtualenv) $ lsb_release
LSB Version:    :core-4.1-amd64:core-4.1-noarch

SELinux

(virtualenv) $ getenforce
Enforcing
pielco11 commented 5 years ago

First, thank you so much for taking time to write a properly documented issue

Limits with --retweets

I tried twint --retweets -u BouloGiletJaune and got 176 tweets, still less than expected. While this might sound like Twint is doing something wrong, the point is that Twitter stops Twint before reaching the beginning of the timeline. This can be verified via browser as showed in this screenshot

immagine

Which is the latest returned tweet by Twint (at least in my experience)

immagine

Profile_full = True

This option requires a lot of time by construction, this should be used only if the account is shadow banned (which means that you can't find his/her/its tweets via search bar)

All options

There is a checking-args error (in Twint code) that I'm going to fix really quickly. Please consider that All option might require a lot of time since it returns tweets sent by/to him, and tweets that mention him https://github.com/twintproject/twint/blob/ad27650fbc0bf8c3f2c78449088a5ede7239f53a/twint/url.py#L100-L101

If you use Twint as module, everything will work as expected

Conclusions

There are limitations with --retweets and --profile-full imposed by Twitter, limitations that we can't handle or workaround.

There's a checking-error in Twint code, which affects only if you use Twint via CLI

ghost commented 5 years ago

@pielco11 thank you for taking the time to formulate a prompt and structured response.

While this might sound like Twint is doing something wrong, the point is that Twitter stops Twint before reaching the beginning of the timeline. This can be verified via browser.

In the browser, the last tweet returned is indeed the same as twint grabs:

1069924606788685824 2018-12-04 12:01:19 BST <BouloGiletJaune> Un moratoire et quelques mesurettes...

In your response you're hinting that Twitter stops before reaching the beginning of the timeline. Maybe there are more tweets before that one? So I've checked using the python-twitter library which taps into Twitter API directly (but is limited to the last 3,200 tweets) and I got 274 tweets. So indeed it's like you wrote: Twitter is limiting scraping.