Open hustshawn opened 7 years ago
I think it could be related to dupefilter used by crawling.distributed_scheduler.DistributedScheduler - this dupefilter uses request_fingerprint function which doesn't work correctly for Splash requests. Default dupefilter doesn't take request.meta values in account, while requests to Splash may differ only in request.meta until they are fixed by a downloader middleware.
Facing the same issue.
See also: https://github.com/istresearch/scrapy-cluster/issues/94. I'm not sure how it can be solved in scrapy-splash itself.
so the scrapy-splash can't work with scrapy-cluster now?
Yes, it can't. Currently one have to fork & fix scrapy-cluster to make them work together. An alternative way is to use Splash HTTP API directly, as shown at https://github.com/scrapy-plugins/scrapy-splash#why-not-use-the-splash-http-api-directly; I'm not completely sure, but likely it would work with scrapy-cluster.
Thanks to @kmike
Do you happen to know where the problem is?
@wenxzhen I'm not a scrapy-cluster user myself, but a brief look results are in this comment: https://github.com/scrapy-plugins/scrapy-splash/issues/101#issuecomment-274729809
Thanks to @kmike After some investigation, found that python is not quite easy to support the serialization and deserialization of class instance. Therefore, I turn to another way:
Now it works
@wenxzhen Could you please share some core code with ur or sent a PR to this repo?
@hustshawn the basic idea is to not use the scrapy-splash stuffs, but to make use of the functionalities of the scrapy-cluster + scrapy.
The followings are mainly for PoC without optimization.
python kafka_monitor.py feed '{"url": "https://www.test.com", "appid":"testapp", "crawlid":"09876abc", "useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36", "attrs": {"splash": "1"}, "spiderid": "test"}'
the "splash": 1 is to tell that the reuqest needs to go to Splash with Http API directly
splash_meta = request.meta[self.splash_meta_name]
args = splash_meta.setdefault('args', {})
splash_url = urljoin(self.splash_base_url, self.default_endpoint)
args.setdefault('splash_url', splash_url)
# only support POST api to Splash now
args.setdefault('http_method', 'POST')
body = json.dumps({"url": request.meta['url'], "wait": 5, "timeout": 10}, sort_keys=True)
args.setdefault('body', body)
headers = Headers({'Content-Type': 'application/json'})
args.setdefault('headers', headers)
def download_request(self, request, spider):
"""Return a deferred for the HTTP download"""
agent = ScrapyAgent(contextFactory=self._contextFactory, pool=self._pool,
maxsize=getattr(spider, 'download_maxsize', self._default_maxsize),
warnsize=getattr(spider, 'download_warnsize', self._default_warnsize))
if "splash" in request.meta:
# we got a Splash forward request now
splash_args = request.meta['splash']['args']
new_splash_request = request.replace(
url = splash_args['splash_url'],
method = splash_args['http_method'],
body = splash_args['body'],
headers = splash_args['headers'],
priority = request.priority
)
return agent.download_request(new_splash_request)
else:
return agent.download_request(request)
Got your idea. Thanks a lot. @wenxzhen
Could you please do PR with this code? Parse JS is really useful feature.
we need to ask @kmike whether the 'basic' solution is acceptable or not? If yes, we can start the PR work.
@wenxzhen did you create a download_handler middleware to implement your solution or did you modify the HTTP11DownloadHandler directly?
I need to do both as I need to bypass the proxy to Splash too
@wenxzhen did you solve it? i also need to proxy and splash.
@FinnFrotscher check the code snippets above, hope it can help.
It seems like https://github.com/scrapy/scrapy/issues/900 could be a good first step towards fixing this.
In single node scrapy project, the settings like below as your document indicate works well.
While if I integrate with the
scrapy-cluster
with below settings, the request withSplashRequest
may not successfully send request tosplash
, so thesplash
not will not respond. Actually, thesplash
itself works fine when I just directly access it with constructed url fromrender.html
endpoint.Anyone knows what's going wrong with the settings?