taspinar / twitterscraper

Scrape Twitter for Tweets
MIT License
2.4k stars 581 forks source link

WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes #302

Open lapp0 opened 4 years ago

lapp0 commented 4 years ago

A variety of issues have recently arisen due to Twitter disabling their "Legacy" API, breaking twitterscraper:

To fix this, I re-implemented query.py using Selenium, allowing twitterscraper to programatically control a background (headless) Firefox instance.

Additionally, I refactored query.py (now query_js.py) to be a bit cleaner.

Based on my testing, this branch can successfully download tweets from user pages, and via query strings.

How to run

Please test this change so I can fix any bugs! 1) clone the repo, pull this branch 2) install selenium dependencies (geckodriver and firefox) https://selenium-python.readthedocs.io/installation.html 3) enter twitterscraper directory, python3 setup.py install 4) run your query

If you have any bugs, please paste your command and full output in this thread!

Improvements

Notes

Problems

webcoderz commented 3 years ago

https://stackoverflow.com/questions/34222412/load-chrome-extension-using-selenium something like that should work using selenium natively to do it

lapp0 commented 3 years ago

@webcoderz thanks for researching!

lapp0 commented 3 years ago

I just pushed a commit with a lot of improvements

Remaining work:

zhicheng0501 commented 3 years ago

I just pushed a commit with a lot of improvements

  • proxies now work with selenium-wire
  • we find the sqrt(N) fastest proxies and use them. Scraping is observably faster.
  • code is refactored, cleaned up, and has better error handling / retrying parameters. Is less prone to failure.

Remaining work:

  • [ ] the main problem: rate limiting by twitter. Perhaps we need a more extensive proxy list? Perhaps different browser profiles might help?
  • [ ] write more test cases
  • [ ] use a larger proxy list
  • [ ] cache the proxy priority list so we aren't speedtesting 100s of proxies each run
  • [ ] handle misc edge cases

You are so great! Hope you will fix the problem and make twitterscraper run more smoothly soon.

lapp0 commented 3 years ago

Thanks! Your work on docker has been great too!

smuotoe commented 3 years ago

Exception Message: Timeout loading page after 10000ms while requesting "https://twitter.com/search?f=live&vertical=default&q=realDonaldTrump since:2020-01-31 until:2020-02-01" Traceback (most recent call last): File "C:\Users\tempSomto\OneDrive\twitterscraper-lappo\query_js.py", line 64, in retrieve_twitter_response_data driver.get(url) File "C:\Users\tempSomto\OneDrive\twitterscraper-lappo\venv\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 333, in get self.execute(Command.GET, {'url': url}) File "C:\Users\tempSomto\OneDrive\twitterscraper-lappo\venv\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute self.error_handler.check_response(response) File "C:\Users\tempSomto\OneDrive\twitterscraper-lappo\venv\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.TimeoutException: Message: Timeout loading page after 10000ms

Hi @lapp0 great work on this thus far. I tested the latest commit and it seems there is something wrong (see above error log).

Also, I noticed the TimeoutException that was imported in query_js was not used in the file.

lapp0 commented 3 years ago

@smuotoe ya, I set a strict timeout because some proxies are pretty slow, so we want to restart with a new proxy if it takes too long loading twitter. I changed it to 30 seconds though.

Probably should have some logic to catch that specific error so the log is cleaner though.

webcoderz commented 3 years ago

whats the current status of the examples @lapp0 ? i havent been able to get anything back with the example in the unit test

LinqLover commented 3 years ago

Hi, any updates on this PR? Can it already be used for production or is the rate limits problem too massive?

I just did some research on proxy list and I think this package sounds quite promising: https://pypi.org/project/proxyscrape/ Usage:

import proxyscrape

def get_proxies():
    collector = proxyscrape.create_collector('default', 'http')
    return collector.get_proxies()

(Returns about 1000 proxies for me) Could this help?

edmangog commented 3 years ago

Hi, I am new to this. I tried to follow comments in this branch, but I still got 0 tweets. Code:

from twitterscraper import query_js
import datetime as dt
import pandas as pd

if __name__ == '__main__':
    tweets = query_js.get_query_data(begindate=dt.date(2019, 11, 11), enddate=dt.date(2019, 11, 12), poolsize=5, lang='en',
                            query='foo bar')
    df = pd.DataFrame(t.__dict__ for t in tweets)
    print(tweets)
    print(df)

console.txt In the console, there are three errors occurred during the run: 1.TypeError: the JSON object must be str, bytes or bytearray, not list 2.selenium.common.exceptions.TimeoutException: Message: Timeout loading page after 30000ms 3.ConnectionAbortedError: [WinError 10053] An established connection was aborted by the software in your host machine

What can I do to this?

LinqLover commented 3 years ago

I'm having the same problem as @edmangog:

>>> import twitterscraper as ts
>>> ts.query_tweets('obama', limit=10)
[]

If I enable logging, I see lots of INFOs about "Got 0 tweets" but no warnings or errors.

Anyone having an idea why? Did they just ban Selenium bots, too? :-(

lapp0 commented 3 years ago

Sorry for the delayed response, I've been quite busy with professional work lately.

Errors

@edmangog Your error is likely due to twitter throttling and/or proxy slowness. It tried for 30 seconds to get tweets and failed, resulting in a cascade of additional errors.

@LinqLover Your error is because you're using the old interface. Thanks for the link though, the proxy lists linked in the documentation for that project will be worth experimenting with.

@webcoderz Yes, unfortunately twitters rate limiting appears to be breaking the test.

Core Rate Limiting Problem

Some recent experimentation has indicated two things:

Since we're all using the same proxy list, our bandwidth is collectively limited. I can retrieve significantly more tweets without being throttled on my local IP than with the shared proxy list, however, I am throttled locally regardless.

Further experimentation may find ways to stretch the usability of these proxies. I'm not even sure what the exact rate-limits are per IP, and knowing that will be valuable. Regardless, a single proxy will always hit its limits, as will a collection of proxies used by a collection of users.

Solution and Implementation

I think the only solution here is to use "personal" proxy servers. This would practically make paid cloud services a requirement for twitterscraper, which may be a necessary evil.

As I mentioned I am quite busy with my professional work, but I will dedicate some time this late November stabilizing this branch and making it compatible with a "custom proxy list". Additionally I will need to write instructions for ad-hoc proxy generation.

Thanks for your patience.

webcoderz commented 3 years ago

try a tor proxy we fixed twints tor proxy and the latency isnt too bad! https://github.com/twintproject/twint/issues/913 heres everything for reference if you choose to go that way,

lapp0 commented 3 years ago

thanks for the reference @webcoderz

Could you clarify the difference between these two projects? Is there some feature in twitterscraper not present in twint?

webcoderz commented 3 years ago

its set up a little differently, but it seems twitter can somewhat detect twint as the last couple of ui changes completely broke it , whereas complete browser scraping cant really be detectable because if done correctly its indiscernable between scraping and actual traffic. (atleast what i think anyways..)

lapp0 commented 3 years ago

Hello, sorry I haven't updated this in a while.

I've tried to make this work, but unfortunately the only workable solution I've found is with a large number of unused proxies. If someone knows of a way to generate a large number of proxies cheaply, perhaps some kind of proxy as a service, please let me know.

zhicheng0501 commented 3 years ago

How about twitterscraper grab data only. Users buy proxies themselves. Is that way more reliable than free proxies?

------------------ 原始邮件 ------------------ 发件人: "taspinar/twitterscraper" <notifications@github.com>; 发送时间: 2021年2月9日(星期二) 晚上6:11 收件人: "taspinar/twitterscraper"<twitterscraper@noreply.github.com>; 抄送: "张志成"<2543161449@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [taspinar/twitterscraper] WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes (#302)

Hello, sorry I haven't updated this in a while.

I've tried to make this work, but unfortunately the only workable solution I've found is with a large number of unused proxies. If someone knows of a way to generate a large number of proxies cheaply, perhaps some kind of proxy as a service, please let me know.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

lapp0 commented 3 years ago

Right, twitterscraper should absolutely be agnostic to the source of the proxies. It's just that I'm not aware of a service I could use to test it out.

webcoderz commented 3 years ago

try it with tor not sure the latency aspect but tor is surefire to work

On Wed, Feb 10, 2021 at 12:37 PM lapp0 notifications@github.com wrote:

Right, twitterscraper should absolutely be agnostic to the source of the proxies. It's just that I'm not aware of a service I could use to test it out.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/taspinar/twitterscraper/pull/302#issuecomment-776923237, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXWRAONVNW454QT5B37HFTS6LG6BANCNFSM4NTF7DSA .

lapp0 commented 3 years ago

Unfortunately, with real browser scraping tor is extremely slow (30 seconds to load a new chunk of tweets vs 2 seconds for a proxy). This may be alleviated by blocking certain twitter requests though.

Either way (proxy or tor), a firefox profile which blocks unnecessary twitter requests would be helpful.