scrapy-plugins / scrapy-splash

Scrapy+Splash for JavaScript integration
BSD 3-Clause "New" or "Revised" License
3.15k stars 450 forks source link

Can't use Scrapy Splash with Firewall and Crawlera on #151

Closed vionemc closed 4 years ago

vionemc commented 6 years ago

So previously my server got hacked because I left Splash port open to public. I now use Firewall to my server address, and Scrapy Splash stop working with this error:


2017-11-22 22:12:45 [scrapy.core.scraper] ERROR: Error downloading <GET https://example.com via http://52.230.25.109:8050/execute>
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 1299, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
    spider=spider)
  File "/usr/lib/python2.7/site-packages/scrapy_splash/middleware.py", line 387, in process_response
    response = self._change_response_class(request, response)
  File "/usr/lib/python2.7/site-packages/scrapy_splash/middleware.py", line 402, in _change_response_class
    response = response.replace(cls=respcls, request=request)
  File "/usr/lib/python2.7/site-packages/scrapy/http/response/text.py", line 50, in replace
    return Response.replace(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/scrapy/http/response/__init__.py", line 79, in replace
    return cls(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/scrapy_splash/response.py", line 33, in __init__
    super(_SplashResponseMixin, self).__init__(url, *args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'

I also use Crawlera, so I think that's one factor. But I really need Crawlera? How can I still make my crawler working but still have enough security? Thanks

vionemc commented 6 years ago

Sorry, it was because of the proxy, that's why

vionemc commented 6 years ago

Hmmm, at the second thought, I'd also change the title a little bit

vionemc commented 6 years ago

I have implemented this code: https://github.com/scrapinghub/sample-projects/tree/master/splash_crawlera_example

Even though I have integrated Scrapy Splash and Crawlera, but it still can't pass the firewall I made on my server for a security reason. I want my server to only accept requests from my server even though it uses multiple IPs from Crawlera.

vionemc commented 6 years ago

73 can be an alternative solution other than using a firewall

vionemc commented 6 years ago

https://splash.readthedocs.io/en/latest/api.html#proxy-profiles

This can be a solution

Gallaecio commented 5 years ago

@vionemc If that solves it, could you close this issue?