taspinar / twitterscraper

Scrape Twitter for Tweets
MIT License
2.39k stars 581 forks source link

WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes #302

Open lapp0 opened 4 years ago

lapp0 commented 4 years ago

A variety of issues have recently arisen due to Twitter disabling their "Legacy" API, breaking twitterscraper:

To fix this, I re-implemented query.py using Selenium, allowing twitterscraper to programatically control a background (headless) Firefox instance.

Additionally, I refactored query.py (now query_js.py) to be a bit cleaner.

Based on my testing, this branch can successfully download tweets from user pages, and via query strings.

How to run

Please test this change so I can fix any bugs! 1) clone the repo, pull this branch 2) install selenium dependencies (geckodriver and firefox) https://selenium-python.readthedocs.io/installation.html 3) enter twitterscraper directory, python3 setup.py install 4) run your query

If you have any bugs, please paste your command and full output in this thread!

Improvements

Notes

Problems

AllanSCosta commented 4 years ago

Oh, that's amazing! Does multiple proxies also work with geckodriver? I had tested with Chrome and couldn't get it to work.

lapp0 commented 4 years ago

@AllanSCosta a new driver is created for each process in the pool, and each driver is initiated with a unique proxy.

This uses FirefoxDriver, but I think ChromeDriver would work for this too.

AllanSCosta commented 4 years ago

Beautiful, thanks!!

@lapp0, if you don't mind me asking, why was your previous usage of UserAgent dropped? I just did a quick run on it, and it seemed fine.

Thanks!

lapp0 commented 4 years ago

@AllanSCosta users were having trouble due to twitter dropping their legacy endpoints, see the linked issues.

hakanyusufoglu commented 4 years ago

@AllanSCosta users were having trouble due to twitter dropping their legacy endpoints, see the linked issues. Thank you,

I get an error like this: selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

Which file do i need to edit?

yiw0104 commented 4 years ago

I got error like this: raise exception_class(message, screen, stacktrace) selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

yiw0104 commented 4 years ago

I got error like this: raise exception_class(message, screen, stacktrace) selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

Problem solved. I forgot to get Firefox installed...😂

AllanSCosta commented 4 years ago

I get an error like this: selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

Which file do i need to edit?

You need to install Geckodriver. If it's a mac, brew install geckodriver should suffice.

lapp0 commented 4 years ago

Oh oops, you're right! I just pushed those changes in misc fixes, reverted!

lapp0 commented 4 years ago

Fun side note: if you want to see the browsers in actions (or if theres an issue see what's going wrong) allow the browser to be visible by setting driver.headless = False here https://github.com/taspinar/twitterscraper/pull/302/files#diff-83a91a4e1920f0a97f5f9b7c5eabefc5R48

Make sure you limit the size of your pool to 1 though!

AllanSCosta commented 4 years ago

Hi @lapp0, I'm still debugging some stuff here. For some reason, the response is proper (200) and I do manage to get data, but in query_single_page the array relevant_requests ends up always empty. For testing I'm running tweets = get_user_data('realDonaldTrump').

[edit] Specifically, it seems that isinstance(r.response.body, dict) is always false in query_single_page

lapp0 commented 4 years ago

@AllanSCosta I could not reproduce. I'm able to get 1300 of trumps tweets.

Could you try again with latest changes, and set headless = False, and tell me if you see any errors on the twitter page itself? (Also add -j to your command)

lapp0 commented 4 years ago

As an aside, it appears that scrolling down on twitter stops after 1300 tweets on realDonaldTrumps page. I'll investigate how to continue scrolling.

Edit: It appears the non-js query.py only gets 621 tweets, so this may just be a fundamental limitation in twitter.

lapp0 commented 4 years ago

https://github.com/taspinar/twitterscraper/pull/304/files appears to fix the main issue. I am going to make js optional here so we can have a backup if/when #304's solution fails.

yiw0104 commented 4 years ago

I ran the code tweets = get_user_data('realDonaldTrump') and got 0 tweets. I also tried tweets = get_query_data("BTS", poolsize = 1, lang = 'english') and got nothing as well.

lapp0 commented 4 years ago

@AllanSCosta @pumpkinw can you please 1) add driver.get_screenshot("foo.png") to this line https://github.com/taspinar/twitterscraper/pull/302/files#diff-83a91a4e1920f0a97f5f9b7c5eabefc5R126

AllanSCosta commented 4 years ago

@lapp0

The screenshot correctly depicts Trump's twitter (as if I had manually opened the browser and accessed it). Here are the versions:

geckodriver 0.26.0 Firefox 77.0.1 (64-bit) OS and version macOS Mojave 10.14.5 Selenium 3.141.0

lapp0 commented 4 years ago

thanks @AllanSCosta

Are you using selenium-wire==1.1.2? It appears I'm using a dated version (0.7.0), as I was able to reproduce this problem by upgrading to 1.1.2.

AllanSCosta commented 4 years ago

I'm using seleniumwire version 1.1.2 indeed :). To clarify, it is properly accessing the page. It's only the parsing of the request results that are failing, as of now. I'm happy to help restructure it for the latest version of seleniumwire if that's the direction you think is the way to go :)

lapp0 commented 4 years ago

Please try now, I have pegged selenium-wire to 1.0.1

AllanSCosta commented 4 years ago

It works now, thanks!! Was the only thing you changed the version of seleniumwire?

yiw0104 commented 4 years ago

Please try now, I have pegged selenium-wire to 1.0.1

Thanks! It works for me now!

lapp0 commented 4 years ago

@AllanSCosta Yes, version >=1.0.2 of selenium-wire doesn't do conversion from gzip bytes -> python object.

barabelha commented 4 years ago

Thank you so much, but I'm kinda lost? I'm new to this and I can't seem to pull your branch from my github desktop. I've installed gecko and selenium, but I didn't understand exactly what I have to do to run the query with your changes. Sorry if it's too much trouble!

lapp0 commented 4 years ago

thanks for testing @barabelha ! To run with my changes, you must add the --javascript argument.

To use my branch you must git remote add upstream lapp0 and git fetch lapp0 and git checkout lapp0/selenium

oluwatimio commented 4 years ago

Are these changes in the master branch now? I would like to use this on my app with pip install. I know there was an issue with twitter scraping from June 1 (their old site was deprecated) so using selenium fixes that. Does the master branch now work?

lapp0 commented 4 years ago

@bamboozooled #304 is in origin/master which fixes the legacy issue for now. It isn't on pip though. For now you need to git clone and python3 setup.py install

This PR isn't in master either, it's still open.

@taspinar what are the procedures to get #304 (currently in origin/master) to pypi? Do we just need a new verison tag on git and github automagically does the work? I think #304 is an important change to get to pypi since it fixes the program.

oluwatimio commented 4 years ago

Thanks a lot @lapp0 !

Michelpayan commented 4 years ago

Hi @lapp0 , I am new at using github so I wanted to know if you could give me more details of how to run "clone the repo, pull this branch" because I'm getting the same problems of getting 0 tweets when using the twitterscraper. Thank you!!

lapp0 commented 4 years ago

@Michelpayan

  1. Install git
  2. run the commands
    git clone https://github.com/lapp0/twitterscraper.git
    git checkout origin/selenium

    change directories twitterscraper, then run python3 setup.py install along with the other install instructions in the post.

Let me know if you have any questions.

smuotoe commented 4 years ago

Hi @lapp0

I set this up just as instructed. It works for a bit, but I keep getting this error: Traceback (most recent call last): File "C:\Users\Administrator\Documents\twitterscraper\js\venv\Scripts\twitterscraper-script.py", line 33, in <module> sys.exit(load_entry_point('twitterscraper==1.4.0', 'console_scripts', 'twitterscraper')()) File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\twitterscraper-1.4.0-py3.7.egg\twitterscraper\main.py", line 118, in main File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\twitterscraper-1.4.0-py3.7.egg\twitterscraper\query_js.py", line 165, in get_query_data File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\twitterscraper-1.4.0-py3.7.egg\twitterscraper\query_js.py", line 186, in retrieve_data_from_urls File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\billiard-3.6.3.0-py3.7.egg\billiard\pool.py", line 1969, in next raise Exception(value) Exception: Traceback (most recent call last): File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\billiard-3.6.3.0-py3.7.egg\billiard\pool.py", line 362, in workloop result = (True, prepare_result(fun(*args, **kwargs))) File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\twitterscraper-1.4.0-py3.7.egg\twitterscraper\query_js.py", line 79, in query_single_page driver = get_driver(proxy) File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\twitterscraper-1.4.0-py3.7.egg\twitterscraper\query_js.py", line 40, in get_driver profile = webdriver.FirefoxProfile() File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\selenium-4.0.0a6.post2-py3.7.egg\selenium\webdriver\firefox\firefox_profile.py", line 59, in __init__ WEBDRIVER_PREFERENCES)) as default_prefs: FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Administrator\\Documents\\twitterscraper\\js\\venv\\lib\\site-packages\\selenium-4.0.0a6.post2-py3.7.egg\\selenium\\webdriver\\firefox\\webdriver_prefs.json'

I have installed both firefox and geckodriver.

I used this on the command-line: twitterscraper realDonaldTrump -j -c -o output.csv -ow And this in Python IDE: js = get_user_data('realDonaldTrump')

Same issue.

Please let me know what I am doing wrongly.

lapp0 commented 4 years ago

@smuotoe Not sure, it seems some users have had problems in the past with this https://github.com/seleniumhq/selenium-google-code-issue-archive/issues/6808

However you are on latest unstable. Perhaps the same mistake was made there? Could you try version 3.141.59?

lukaspistelak commented 4 years ago

i got this error:

Can't load /home/ml/.rnd into RNG 140700360672320:error:2406F079:random number generator:RAND_load_file:Cannot open file:../crypto/rand/randfile.c:98:Filename=/home/ml/.rnd

and after couple of min.:

Exception happened during processing of request from ('127.0.0.1', 50790) Traceback (most recent call last): File "/usr/lib/python3.6/socketserver.py", line 654, in process_request_thread self.finish_request(request, client_address) File "/usr/lib/python3.6/socketserver.py", line 364, in finish_request self.RequestHandlerClass(request, client_address, self) File "/usr/local/lib/python3.6/dist-packages/selenium_wire-1.0.1-py3.6.egg/seleniumwire/proxy/proxy2.py", line 65, in __init__ super().__init__(*args, **kwargs) File "/usr/lib/python3.6/socketserver.py", line 724, in __init__ self.handle() File "/usr/lib/python3.6/http/server.py", line 420, in handle self.handle_one_request() File "/usr/lib/python3.6/http/server.py", line 406, in handle_one_request method() File "/usr/local/lib/python3.6/dist-packages/selenium_wire-1.0.1-py3.6.egg/seleniumwire/proxy/handler.py", line 127, in do_GET super().do_GET() File "/usr/local/lib/python3.6/dist-packages/selenium_wire-1.0.1-py3.6.egg/seleniumwire/proxy/proxy2.py", line 224, in do_GET self.wfile.write(res_body) File "/usr/lib/python3.6/socket.py", line 604, in write return self._sock.send(b) File "/usr/lib/python3.6/ssl.py", line 944, in send return self._sslobj.write(data) File "/usr/lib/python3.6/ssl.py", line 642, in write return self._sslobj.write(data) ConnectionResetError: [Errno 104] Connection reset by peer

EDIT:

looks like sudo helps:

but still is reset by peer error here

smuotoe commented 4 years ago

@smuotoe Not sure, it seems some users have had problems in the past with this SeleniumHQ/selenium-google-code-issue-archive#6808

However you are on latest unstable. Perhaps the same mistake was made there? Could you try version 3.141.59?

Thanks, it seems the latest version has a bug. It works with a lower version.

taspinar commented 3 years ago

Hi @lapp0 Thanks for this update to twitterscraper and your help in other areas. MR #304 has already been merged and is available in version 1.5.0. twitterscraper is now at version 1.6.0 which also includes other updates.

Let me test his branch tonight and see if we can make it available as version 2.0.0

taspinar commented 3 years ago

Hi @lapp0 I tried to follow the procedures described on this page as much as possible. I implemented the fix for dateranges in a local version. Whenever I use the -jarguemnt, I get consistently a lower number of tweets than the indicated limit (zero number of tweets or a few tens number of tweets).

I have installed Firefox 78.0.2 on Mac, GeckoDriver 0.26.0 Selenium version 3.141.0, MacOS Catalina 10.15.5

PS: Selenium does not open a Firefox browser.

lapp0 commented 3 years ago

@taspinar I have a test case that I'll push to this branch. Still working on figuring out why this version has fewer tweets.

PS: Selenium does not open a Firefox browser.

It should? It explicitly uses geckodriver in the code. It is however /headless/. You may change opt.headless = True to False if you want to see how its behaving with a GUI firefox.

lapp0 commented 3 years ago

Remaining work:

LinqLover commented 3 years ago

Hi @lapp0, has there been any progress on this so far? Does your version still work after the recent changes on the Twitter website (see #336, #339, #343, #344, #337, twintproject/twint#604, Mottl/GetOldTweets3#98)? If yes, your work would be very valuable for us from @Museum-Barberini-gGmbH.

JiGGie145 commented 3 years ago

I can confirm that this method works quite well. I am wondering if using helium won't solve the geckodriver installation issue? As helium ships with common webdrivers.

See helium https://github.com/mherrmann/selenium-python-helium

lapp0 commented 3 years ago

@LinqLover this branch works in that it retrieves tweets using a headless browser successfully, emulating a real user scrolling. However, it doesn't convert the results to a Tweet object. I've been quite busy lately and haven't had a chance to address this (additionally, another user came up with a fix making this branch less urgent for the time being).

Additionally, the API returns tweets that aren't in the given daterange. When you search for a tweet in a given daterange, you might see tweets that reply to a tweet outside of the daterange, and this script includes those tweets in the result set.

So to get this branch merged, we need to fix the extra tweets and convert dict to Tweet fixes in.

We can shove the data we have now into Tweet objects, but this API returns so much more, e.g.

{'created_at': 'Sat Nov 11 02:48:31 +0000 2017', 'id': 929179001645658112, 'id_str': '929179001645658112', 'full_text': '@RutgersMBB starts its hoop season in style, annihilating CCNY 94-38. Issa Thiam paced RU with 19 &amp; 11, &amp; DeShawn Freeman chipped in 16. #Rutgers continues with alphabet soup on Sunday, taking on CCSU. @OTB_SBNation @TheChopNation https://t.co/XHz9XBybXN', 'truncated': False, 'display_text_range': [0, 238], 'entities': {'hashtags': [{'text': 'Rutgers', 'indices': [145, 153]}], 'symbols': [], 'user_mentions': [{'screen_name': 'RutgersMBB', 'name': 'Rutgers Basketball 🏀', 'id': 902030382, 'id_str': '902030382', 'indices': [0, 11]}, {'screen_name': 'OTB_SBNation', 'name': 'On the Banks', 'id': 1639427150, 'id_str': '1639427150', 'indices': [210, 223]}, {'screen_name': 'TheChopNation', 'name': '#Rutgers Chop Nation', 'id': 1201342780456472577, 'id_str': '1201342780456472577', 'indices': [224, 238]}], 'urls': [], 'media': [{'id': 929178995580657665, 'id_str': '929178995580657665', 'indices': [239, 262], 'media_url': 'http://pbs.twimg.com/media/DOUbX5xW4AEyfBQ.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DOUbX5xW4AEyfBQ.jpg', 'url': 'https://t.co/XHz9XBybXN', 'display_url': 'pic.twitter.com/XHz9XBybXN', 'expanded_url': 'https://twitter.com/MoreSportsNow/status/929179001645658112/photo/1', 'type': 'photo', 'original_info': {'width': 1024, 'height': 794, 'focus_rects': [{'x': 0, 'y': 0, 'h': 573, 'w': 1024}, {'x': 140, 'y': 0, 'h': 794, 'w': 794}, {'x': 189, 'y': 0, 'h': 794, 'w': 696}, {'x': 339, 'y': 0, 'h': 794, 'w': 397}]}, 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'large': {'w': 1024, 'h': 794, 'resize': 'fit'}, 'medium': {'w': 1024, 'h': 794, 'resize': 'fit'}, 'small': {'w': 680, 'h': 527, 'resize': 'fit'}}}]}, 'extended_entities': {'media': [{'id': 929178995580657665, 'id_str': '929178995580657665', 'indices': [239, 262], 'media_url': 'http://pbs.twimg.com/media/DOUbX5xW4AEyfBQ.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DOUbX5xW4AEyfBQ.jpg', 'url': 'https://t.co/XHz9XBybXN', 'display_url': 'pic.twitter.com/XHz9XBybXN', 'expanded_url': 'https://twitter.com/MoreSportsNow/status/929179001645658112/photo/1', 'type': 'photo', 'original_info': {'width': 1024, 'height': 794, 'focus_rects': [{'x': 0, 'y': 0, 'h': 573, 'w': 1024}, {'x': 140, 'y': 0, 'h': 794, 'w': 794}, {'x': 189, 'y': 0, 'h': 794, 'w': 696}, {'x': 339, 'y': 0, 'h': 794, 'w': 397}]}, 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'large': {'w': 1024, 'h': 794, 'resize': 'fit'}, 'medium': {'w': 1024, 'h': 794, 'resize': 'fit'}, 'small': {'w': 680, 'h': 527, 'resize': 'fit'}}, 'media_key': '3_929178995580657665', 'ext_alt_text': None, 'ext_media_color': {'palette': [{'rgb': {'red': 255, 'green': 255, 'blue': 255}, 'percentage': 41.66}, {'rgb': {'red': 214, 'green': 16, 'blue': 17}, 'percentage': 32.85}, {'rgb': {'red': 21, 'green': 10, 'blue': 8}, 'percentage': 15.73}, {'rgb': {'red': 109, 'green': 12, 'blue': 10}, 'percentage': 2.15}, {'rgb': {'red': 64, 'green': 6, 'blue': 6}, 'percentage': 0.63}]}, 'ext_media_availability': {'status': 'available'}, 'ext': {'mediaStats': {'r': 'Missing', 'ttl': -1}}}]}, 'source': '<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': 902030382, 'in_reply_to_user_id_str': '902030382', 'in_reply_to_screen_name': 'RutgersMBB', 'user_id': 547045771, 'user_id_str': '547045771', 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 0, 'favorite_count': 0, 'reply_count': 0, 'quote_count': 0, 'conversation_id': 929179001645658112, 'conversation_id_str': '929179001645658112', 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'possibly_sensitive_editable': True, 'lang': 'en', 'supplemental_language': None}
lapp0 commented 3 years ago

@taspinar what are these #hack? items supposed to be?

        tweets.append(Tweet(
            screen_name=user['screen_name'],
            username=user['name'],
            user_id=tweet_item['user_id'],
            tweet_id=tid,
            tweet_url=f'https://twitter.com/{user["screen_name"]}/status/{tid}',  # hack?
            timestamp=timestamp_of_tweet(tweet_item),  # hack?
            timestamp_epochs=timestamp_of_tweet(tweet_item),  # hack?
            text=tweet_item['full_text'],
            text_html=None,  # hack?
            links=tweet_item['entities']['urls'],
            hashtags=tweet_item['entities']['hashtags'],
            has_media=None,  # hack?
            img_urls=None,  # hack?
            parent_tweet_id=tweet_item['in_reply_to_status_id'],
            reply_to_users=tweet_item['in_reply_to_user_id'],  # hack?
        ))

This is what data['tweets'].items()]0] looks like:

('932325048706387969', {'created_at': 'Sun Nov 19 19:09:48 +0000 2017', 'id': 932325048706387969, 'id_str': '932325048706387969', 'full_text': 'Sonic Burger with alphabet soup for dipping', 'truncated': False, 'display_text_range': [0, 43], 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, 'source': '<a href="http://www.empty-handed.com/" rel="nofollow">Secret Menu Item Generator</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user_id': 775339094603927552, 'user_id_str': '775339094603927552', 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 0, 'favorite_count': 0, 'reply_count': 0, 'quote_count': 0, 'conversation_id': 932325048706387969, 'conversation_id_str': '932325048706387969', 'favorited': False, 'retweeted': False, 'lang': 'en', 'supplemental_language': None})

I fixed the daterange issue and Tweet issue (partially) and pushed.

lapp0 commented 3 years ago

@JiGGie145 I'm not familiar with helium, but it looks quite useful given the issues users in this thread have been having. Would you like to submit your own pull request targeting this one that includes helium?

christiangfv commented 3 years ago

Hi @lapp0, your work is beautifull, but not work for me :c

Mozilla Firefox 80.0.1 on MAC geckodriverVersion': '0.27.0' selenium: 3.141.0 selenium-wire: 1.0.1 macOS 10.15.6

my code is import twitterscraper tweets = twitterscraper.get_user_data('realDonaldTrump',poolsize=1) print(tweets) and when open the explorer Firefox, the message is the next image please help. Will it have to do with the proxy?

lapp0 commented 3 years ago

Thanks @christiangfv

On the surface that appears to be an issue with one or many proxies.

1) have you tried running it a second time? 2) Have you tried running twitterscraper with proxies disabled?

Also what error are you getting in the terminal? (Not browser) This could help us handle broken proxies if a broken proxy is the culprit.

christiangfv commented 3 years ago

Thanks for your attention @lapp0

  1. Yes, I have tried several times, even restarting the machine.
  2. how i can disable proxies?

In my terminal INFO: Using proxy 143.255.52.102:31158 INFO: Scraping tweets from https://twitter.com/search?f=live&vertical=default&q=from:realDonaldTrump since:2020-09-19 until:2020-09-21&l= INFO: Using proxy 194.156.229.160:80 INFO: Got 0 data (0 new). INFO: Scraping tweets from https://twitter.com/search?f=live&vertical=default&q=filter:nativeretweets from:realDonaldTrump since:2020-09-18 until:2020-09-19&l= INFO: Using proxy 114.7.193.214:8080 INFO: Got 0 data (0 new).

lapp0 commented 3 years ago

@christiangfv try --disableproxy

lapp0 commented 3 years ago

You may be experiencing this issue https://github.com/wkeeling/selenium-wire/issues/55#issuecomment-511182605

lapp0 commented 3 years ago

I'm upgrading selenium-wire, hopefully this resolves the issue.

christiangfv commented 3 years ago

not working --disableprxy nor -dp. I try in terminal commands, for example: twitterscraper Trump --disableproxy --limit 10 --output=tweets.json. And when I modify the get_driver() function, the same thing happens image