Open lapp0 opened 4 years ago
Oh, that's amazing! Does multiple proxies also work with geckodriver? I had tested with Chrome and couldn't get it to work.
@AllanSCosta a new driver is created for each process in the pool, and each driver is initiated with a unique proxy.
This uses FirefoxDriver, but I think ChromeDriver would work for this too.
Beautiful, thanks!!
@lapp0, if you don't mind me asking, why was your previous usage of UserAgent dropped? I just did a quick run on it, and it seemed fine.
Thanks!
@AllanSCosta users were having trouble due to twitter dropping their legacy endpoints, see the linked issues.
@AllanSCosta users were having trouble due to twitter dropping their legacy endpoints, see the linked issues. Thank you,
I get an error like this: selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.
Which file do i need to edit?
I got error like this: raise exception_class(message, screen, stacktrace) selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities
I got error like this: raise exception_class(message, screen, stacktrace) selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities
Problem solved. I forgot to get Firefox installed...😂
I get an error like this: selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.
Which file do i need to edit?
You need to install Geckodriver. If it's a mac, brew install geckodriver
should suffice.
Oh oops, you're right! I just pushed those changes in misc fixes, reverted!
Fun side note: if you want to see the browsers in actions (or if theres an issue see what's going wrong) allow the browser to be visible by setting driver.headless = False
here https://github.com/taspinar/twitterscraper/pull/302/files#diff-83a91a4e1920f0a97f5f9b7c5eabefc5R48
Make sure you limit the size of your pool to 1 though!
Hi @lapp0, I'm still debugging some stuff here. For some reason, the response is proper (200) and I do manage to get data, but in query_single_page
the array relevant_requests
ends up always empty. For testing I'm running tweets = get_user_data('realDonaldTrump')
.
[edit] Specifically, it seems that isinstance(r.response.body, dict)
is always false in query_single_page
@AllanSCosta I could not reproduce. I'm able to get 1300 of trumps tweets.
Could you try again with latest changes, and set headless = False
, and tell me if you see any errors on the twitter page itself? (Also add -j
to your command)
As an aside, it appears that scrolling down on twitter stops after 1300 tweets on realDonaldTrump
s page. I'll investigate how to continue scrolling.
Edit: It appears the non-js query.py only gets 621 tweets, so this may just be a fundamental limitation in twitter.
https://github.com/taspinar/twitterscraper/pull/304/files appears to fix the main issue. I am going to make js optional here so we can have a backup if/when #304's solution fails.
I ran the code tweets = get_user_data('realDonaldTrump')
and got 0 tweets.
I also tried tweets = get_query_data("BTS", poolsize = 1, lang = 'english')
and got nothing as well.
@AllanSCosta @pumpkinw can you please
1) add driver.get_screenshot("foo.png")
to this line https://github.com/taspinar/twitterscraper/pull/302/files#diff-83a91a4e1920f0a97f5f9b7c5eabefc5R126
@lapp0
The screenshot correctly depicts Trump's twitter (as if I had manually opened the browser and accessed it). Here are the versions:
geckodriver 0.26.0 Firefox 77.0.1 (64-bit) OS and version macOS Mojave 10.14.5 Selenium 3.141.0
thanks @AllanSCosta
Are you using selenium-wire==1.1.2? It appears I'm using a dated version (0.7.0), as I was able to reproduce this problem by upgrading to 1.1.2.
I'm using seleniumwire
version 1.1.2 indeed :). To clarify, it is properly accessing the page. It's only the parsing of the request results that are failing, as of now. I'm happy to help restructure it for the latest version of seleniumwire if that's the direction you think is the way to go :)
Please try now, I have pegged selenium-wire to 1.0.1
It works now, thanks!! Was the only thing you changed the version of seleniumwire?
Please try now, I have pegged selenium-wire to 1.0.1
Thanks! It works for me now!
@AllanSCosta Yes, version >=1.0.2 of selenium-wire doesn't do conversion from gzip bytes -> python object.
Thank you so much, but I'm kinda lost? I'm new to this and I can't seem to pull your branch from my github desktop. I've installed gecko and selenium, but I didn't understand exactly what I have to do to run the query with your changes. Sorry if it's too much trouble!
thanks for testing @barabelha ! To run with my changes, you must add the --javascript
argument.
To use my branch you must git remote add upstream lapp0
and git fetch lapp0
and git checkout lapp0/selenium
Are these changes in the master branch now? I would like to use this on my app with pip install. I know there was an issue with twitter scraping from June 1 (their old site was deprecated) so using selenium fixes that. Does the master branch now work?
@bamboozooled #304 is in origin/master
which fixes the legacy issue for now. It isn't on pip though. For now you need to git clone and python3 setup.py install
This PR isn't in master either, it's still open.
@taspinar what are the procedures to get #304 (currently in origin/master
) to pypi? Do we just need a new verison tag on git and github automagically does the work? I think #304 is an important change to get to pypi since it fixes the program.
Thanks a lot @lapp0 !
Hi @lapp0 , I am new at using github so I wanted to know if you could give me more details of how to run "clone the repo, pull this branch" because I'm getting the same problems of getting 0 tweets when using the twitterscraper. Thank you!!
@Michelpayan
git clone https://github.com/lapp0/twitterscraper.git
git checkout origin/selenium
change directories twitterscraper
, then run python3 setup.py install
along with the other install instructions in the post.
Let me know if you have any questions.
Hi @lapp0
I set this up just as instructed. It works for a bit, but I keep getting this error:
Traceback (most recent call last): File "C:\Users\Administrator\Documents\twitterscraper\js\venv\Scripts\twitterscraper-script.py", line 33, in <module> sys.exit(load_entry_point('twitterscraper==1.4.0', 'console_scripts', 'twitterscraper')()) File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\twitterscraper-1.4.0-py3.7.egg\twitterscraper\main.py", line 118, in main File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\twitterscraper-1.4.0-py3.7.egg\twitterscraper\query_js.py", line 165, in get_query_data File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\twitterscraper-1.4.0-py3.7.egg\twitterscraper\query_js.py", line 186, in retrieve_data_from_urls File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\billiard-3.6.3.0-py3.7.egg\billiard\pool.py", line 1969, in next raise Exception(value) Exception: Traceback (most recent call last): File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\billiard-3.6.3.0-py3.7.egg\billiard\pool.py", line 362, in workloop result = (True, prepare_result(fun(*args, **kwargs))) File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\twitterscraper-1.4.0-py3.7.egg\twitterscraper\query_js.py", line 79, in query_single_page driver = get_driver(proxy) File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\twitterscraper-1.4.0-py3.7.egg\twitterscraper\query_js.py", line 40, in get_driver profile = webdriver.FirefoxProfile() File "C:\Users\Administrator\Documents\twitterscraper\js\venv\lib\site-packages\selenium-4.0.0a6.post2-py3.7.egg\selenium\webdriver\firefox\firefox_profile.py", line 59, in __init__ WEBDRIVER_PREFERENCES)) as default_prefs: FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Administrator\\Documents\\twitterscraper\\js\\venv\\lib\\site-packages\\selenium-4.0.0a6.post2-py3.7.egg\\selenium\\webdriver\\firefox\\webdriver_prefs.json'
I have installed both firefox and geckodriver.
I used this on the command-line: twitterscraper realDonaldTrump -j -c -o output.csv -ow
And this in Python IDE: js = get_user_data('realDonaldTrump')
Same issue.
Please let me know what I am doing wrongly.
@smuotoe Not sure, it seems some users have had problems in the past with this https://github.com/seleniumhq/selenium-google-code-issue-archive/issues/6808
However you are on latest unstable. Perhaps the same mistake was made there? Could you try version 3.141.59?
i got this error:
Can't load /home/ml/.rnd into RNG 140700360672320:error:2406F079:random number generator:RAND_load_file:Cannot open file:../crypto/rand/randfile.c:98:Filename=/home/ml/.rnd
and after couple of min.:
Exception happened during processing of request from ('127.0.0.1', 50790) Traceback (most recent call last): File "/usr/lib/python3.6/socketserver.py", line 654, in process_request_thread self.finish_request(request, client_address) File "/usr/lib/python3.6/socketserver.py", line 364, in finish_request self.RequestHandlerClass(request, client_address, self) File "/usr/local/lib/python3.6/dist-packages/selenium_wire-1.0.1-py3.6.egg/seleniumwire/proxy/proxy2.py", line 65, in __init__ super().__init__(*args, **kwargs) File "/usr/lib/python3.6/socketserver.py", line 724, in __init__ self.handle() File "/usr/lib/python3.6/http/server.py", line 420, in handle self.handle_one_request() File "/usr/lib/python3.6/http/server.py", line 406, in handle_one_request method() File "/usr/local/lib/python3.6/dist-packages/selenium_wire-1.0.1-py3.6.egg/seleniumwire/proxy/handler.py", line 127, in do_GET super().do_GET() File "/usr/local/lib/python3.6/dist-packages/selenium_wire-1.0.1-py3.6.egg/seleniumwire/proxy/proxy2.py", line 224, in do_GET self.wfile.write(res_body) File "/usr/lib/python3.6/socket.py", line 604, in write return self._sock.send(b) File "/usr/lib/python3.6/ssl.py", line 944, in send return self._sslobj.write(data) File "/usr/lib/python3.6/ssl.py", line 642, in write return self._sslobj.write(data) ConnectionResetError: [Errno 104] Connection reset by peer
EDIT:
looks like sudo helps:
but still is reset by peer error here
@smuotoe Not sure, it seems some users have had problems in the past with this SeleniumHQ/selenium-google-code-issue-archive#6808
However you are on latest unstable. Perhaps the same mistake was made there? Could you try version 3.141.59?
Thanks, it seems the latest version has a bug. It works with a lower version.
Hi @lapp0 Thanks for this update to twitterscraper and your help in other areas. MR #304 has already been merged and is available in version 1.5.0. twitterscraper is now at version 1.6.0 which also includes other updates.
Let me test his branch tonight and see if we can make it available as version 2.0.0
Hi @lapp0
I tried to follow the procedures described on this page as much as possible. I implemented the fix for dateranges
in a local version.
Whenever I use the -j
arguemnt, I get consistently a lower number of tweets than the indicated limit (zero number of tweets or a few tens number of tweets).
I have installed Firefox 78.0.2 on Mac, GeckoDriver 0.26.0 Selenium version 3.141.0, MacOS Catalina 10.15.5
PS: Selenium does not open a Firefox browser.
@taspinar I have a test case that I'll push to this branch. Still working on figuring out why this version has fewer tweets.
PS: Selenium does not open a Firefox browser.
It should? It explicitly uses geckodriver in the code. It is however /headless/. You may change opt.headless = True
to False
if you want to see how its behaving with a GUI firefox.
Remaining work:
Tweet
Hi @lapp0, has there been any progress on this so far? Does your version still work after the recent changes on the Twitter website (see #336, #339, #343, #344, #337, twintproject/twint#604, Mottl/GetOldTweets3#98)? If yes, your work would be very valuable for us from @Museum-Barberini-gGmbH.
I can confirm that this method works quite well. I am wondering if using helium won't solve the geckodriver installation issue? As helium ships with common webdrivers.
See helium https://github.com/mherrmann/selenium-python-helium
@LinqLover this branch works in that it retrieves tweets using a headless browser successfully, emulating a real user scrolling. However, it doesn't convert the results to a Tweet
object. I've been quite busy lately and haven't had a chance to address this (additionally, another user came up with a fix making this branch less urgent for the time being).
Additionally, the API returns tweets that aren't in the given daterange. When you search for a tweet in a given daterange, you might see tweets that reply to a tweet outside of the daterange, and this script includes those tweets in the result set.
So to get this branch merged, we need to fix the extra tweets and convert dict to Tweet
fixes in.
We can shove the data we have now into Tweet
objects, but this API returns so much more, e.g.
{'created_at': 'Sat Nov 11 02:48:31 +0000 2017', 'id': 929179001645658112, 'id_str': '929179001645658112', 'full_text': '@RutgersMBB starts its hoop season in style, annihilating CCNY 94-38. Issa Thiam paced RU with 19 & 11, & DeShawn Freeman chipped in 16. #Rutgers continues with alphabet soup on Sunday, taking on CCSU. @OTB_SBNation @TheChopNation https://t.co/XHz9XBybXN', 'truncated': False, 'display_text_range': [0, 238], 'entities': {'hashtags': [{'text': 'Rutgers', 'indices': [145, 153]}], 'symbols': [], 'user_mentions': [{'screen_name': 'RutgersMBB', 'name': 'Rutgers Basketball 🏀', 'id': 902030382, 'id_str': '902030382', 'indices': [0, 11]}, {'screen_name': 'OTB_SBNation', 'name': 'On the Banks', 'id': 1639427150, 'id_str': '1639427150', 'indices': [210, 223]}, {'screen_name': 'TheChopNation', 'name': '#Rutgers Chop Nation', 'id': 1201342780456472577, 'id_str': '1201342780456472577', 'indices': [224, 238]}], 'urls': [], 'media': [{'id': 929178995580657665, 'id_str': '929178995580657665', 'indices': [239, 262], 'media_url': 'http://pbs.twimg.com/media/DOUbX5xW4AEyfBQ.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DOUbX5xW4AEyfBQ.jpg', 'url': 'https://t.co/XHz9XBybXN', 'display_url': 'pic.twitter.com/XHz9XBybXN', 'expanded_url': 'https://twitter.com/MoreSportsNow/status/929179001645658112/photo/1', 'type': 'photo', 'original_info': {'width': 1024, 'height': 794, 'focus_rects': [{'x': 0, 'y': 0, 'h': 573, 'w': 1024}, {'x': 140, 'y': 0, 'h': 794, 'w': 794}, {'x': 189, 'y': 0, 'h': 794, 'w': 696}, {'x': 339, 'y': 0, 'h': 794, 'w': 397}]}, 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'large': {'w': 1024, 'h': 794, 'resize': 'fit'}, 'medium': {'w': 1024, 'h': 794, 'resize': 'fit'}, 'small': {'w': 680, 'h': 527, 'resize': 'fit'}}}]}, 'extended_entities': {'media': [{'id': 929178995580657665, 'id_str': '929178995580657665', 'indices': [239, 262], 'media_url': 'http://pbs.twimg.com/media/DOUbX5xW4AEyfBQ.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DOUbX5xW4AEyfBQ.jpg', 'url': 'https://t.co/XHz9XBybXN', 'display_url': 'pic.twitter.com/XHz9XBybXN', 'expanded_url': 'https://twitter.com/MoreSportsNow/status/929179001645658112/photo/1', 'type': 'photo', 'original_info': {'width': 1024, 'height': 794, 'focus_rects': [{'x': 0, 'y': 0, 'h': 573, 'w': 1024}, {'x': 140, 'y': 0, 'h': 794, 'w': 794}, {'x': 189, 'y': 0, 'h': 794, 'w': 696}, {'x': 339, 'y': 0, 'h': 794, 'w': 397}]}, 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'large': {'w': 1024, 'h': 794, 'resize': 'fit'}, 'medium': {'w': 1024, 'h': 794, 'resize': 'fit'}, 'small': {'w': 680, 'h': 527, 'resize': 'fit'}}, 'media_key': '3_929178995580657665', 'ext_alt_text': None, 'ext_media_color': {'palette': [{'rgb': {'red': 255, 'green': 255, 'blue': 255}, 'percentage': 41.66}, {'rgb': {'red': 214, 'green': 16, 'blue': 17}, 'percentage': 32.85}, {'rgb': {'red': 21, 'green': 10, 'blue': 8}, 'percentage': 15.73}, {'rgb': {'red': 109, 'green': 12, 'blue': 10}, 'percentage': 2.15}, {'rgb': {'red': 64, 'green': 6, 'blue': 6}, 'percentage': 0.63}]}, 'ext_media_availability': {'status': 'available'}, 'ext': {'mediaStats': {'r': 'Missing', 'ttl': -1}}}]}, 'source': '<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': 902030382, 'in_reply_to_user_id_str': '902030382', 'in_reply_to_screen_name': 'RutgersMBB', 'user_id': 547045771, 'user_id_str': '547045771', 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 0, 'favorite_count': 0, 'reply_count': 0, 'quote_count': 0, 'conversation_id': 929179001645658112, 'conversation_id_str': '929179001645658112', 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'possibly_sensitive_editable': True, 'lang': 'en', 'supplemental_language': None}
@taspinar what are these #hack?
items supposed to be?
tweets.append(Tweet(
screen_name=user['screen_name'],
username=user['name'],
user_id=tweet_item['user_id'],
tweet_id=tid,
tweet_url=f'https://twitter.com/{user["screen_name"]}/status/{tid}', # hack?
timestamp=timestamp_of_tweet(tweet_item), # hack?
timestamp_epochs=timestamp_of_tweet(tweet_item), # hack?
text=tweet_item['full_text'],
text_html=None, # hack?
links=tweet_item['entities']['urls'],
hashtags=tweet_item['entities']['hashtags'],
has_media=None, # hack?
img_urls=None, # hack?
parent_tweet_id=tweet_item['in_reply_to_status_id'],
reply_to_users=tweet_item['in_reply_to_user_id'], # hack?
))
This is what data['tweets'].items()]0]
looks like:
('932325048706387969', {'created_at': 'Sun Nov 19 19:09:48 +0000 2017', 'id': 932325048706387969, 'id_str': '932325048706387969', 'full_text': 'Sonic Burger with alphabet soup for dipping', 'truncated': False, 'display_text_range': [0, 43], 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, 'source': '<a href="http://www.empty-handed.com/" rel="nofollow">Secret Menu Item Generator</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user_id': 775339094603927552, 'user_id_str': '775339094603927552', 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 0, 'favorite_count': 0, 'reply_count': 0, 'quote_count': 0, 'conversation_id': 932325048706387969, 'conversation_id_str': '932325048706387969', 'favorited': False, 'retweeted': False, 'lang': 'en', 'supplemental_language': None})
I fixed the daterange issue and Tweet issue (partially) and pushed.
@JiGGie145 I'm not familiar with helium, but it looks quite useful given the issues users in this thread have been having. Would you like to submit your own pull request targeting this one that includes helium?
Hi @lapp0, your work is beautifull, but not work for me :c
Mozilla Firefox 80.0.1 on MAC geckodriverVersion': '0.27.0' selenium: 3.141.0 selenium-wire: 1.0.1 macOS 10.15.6
my code is
import twitterscraper
tweets = twitterscraper.get_user_data('realDonaldTrump',poolsize=1)
print(tweets)
and when open the explorer Firefox, the message is the next
please help. Will it have to do with the proxy?
Thanks @christiangfv
On the surface that appears to be an issue with one or many proxies.
1) have you tried running it a second time? 2) Have you tried running twitterscraper with proxies disabled?
Also what error are you getting in the terminal? (Not browser) This could help us handle broken proxies if a broken proxy is the culprit.
Thanks for your attention @lapp0
In my terminal INFO: Using proxy 143.255.52.102:31158 INFO: Scraping tweets from https://twitter.com/search?f=live&vertical=default&q=from:realDonaldTrump since:2020-09-19 until:2020-09-21&l= INFO: Using proxy 194.156.229.160:80 INFO: Got 0 data (0 new). INFO: Scraping tweets from https://twitter.com/search?f=live&vertical=default&q=filter:nativeretweets from:realDonaldTrump since:2020-09-18 until:2020-09-19&l= INFO: Using proxy 114.7.193.214:8080 INFO: Got 0 data (0 new).
@christiangfv try --disableproxy
You may be experiencing this issue https://github.com/wkeeling/selenium-wire/issues/55#issuecomment-511182605
I'm upgrading selenium-wire
, hopefully this resolves the issue.
not working --disableprxy nor -dp. I try in terminal commands, for example: twitterscraper Trump --disableproxy --limit 10 --output=tweets.json
.
And when I modify the get_driver() function, the same thing happens
A variety of issues have recently arisen due to Twitter disabling their "Legacy" API, breaking twitterscraper:
To fix this, I re-implemented
query.py
using Selenium, allowing twitterscraper to programatically control a background (headless) Firefox instance.Additionally, I refactored
query.py
(nowquery_js.py
) to be a bit cleaner.Based on my testing, this branch can successfully download tweets from user pages, and via query strings.
How to run
Please test this change so I can fix any bugs! 1) clone the repo, pull this branch 2) install selenium dependencies (geckodriver and firefox) https://selenium-python.readthedocs.io/installation.html 3) enter twitterscraper directory,
python3 setup.py install
4) run your queryIf you have any bugs, please paste your command and full output in this thread!
Improvements
get_query_data
(all tweets / metadata from a specific query) andget_user_data
(all tweets / metadata on a users page).--user
wouldn't get all of a users tweets and retweets due to a limitation in twitters scrollback for a given user. Now a workaround enables retrieving of tweets and retweets for a specific user via a custom search:f'filter:nativeretweets from:{from_user}'
query_user_info
brokenNotes
pos
was removed - now the browser is used to storepos
state implicitly--javascript
and-j
now decide whether to usequery.py
orquery_js.py
Problems
limit
no longer works, though this should be relatively easy to fix if sufficiently desired~ (limit has now been implementedquery_user_info
andquery_user_page
haven't been converted to use selenium, they don't work right now. However, this data is returned as part of the metadata mentioned in Improvements bullet 2pip install
. However use of docker can alleviate this.