Open dijadev opened 7 years ago
I also have this issue.
NORMAL REQUEST - it will follow the rules and Follow=True yield scrapy.Request(url, callback=self.parse, dont_filter=True, errback=self.errback_httpbin)
USING SPLASH - it will only visit the first url yield scrapy.Request(url, callback=self.parse, dont_filter=True, errback=self.errback_httpbin, meta={ 'splash': { 'endpoint': 'render.html', 'args': { 'wait': 0.5 } } })
Has someone found the solution ?
i have not. unfortunately
On Fri, Jan 27, 2017 at 1:10 PM -0500, "dijadev" notifications@github.com wrote:
Has someone found the solution ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
I have the same problem, any solution?
Negative.
+1 over here. Encountering the same issue as described by @wattsin.
I also got the same issue here today and found that CrawlSpider do a response type check in _requests_to_follow function
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
...
However responses generated by Splash would be SplashTextResponse or SplashJsonResponse. That check caused splash response won't have any requests to follow.
+1
+1
@dwj1324
I tried to debug my spider with PyCharm and set a breakpoint at if not isinstance(response, HtmlResponse):
. That code was never reached when SplashRequest
was used instead of scrapy.Request
.
What worked for me is to add this to the callback parsing function:
def parse_item(self, response):
"""Parse response into item also create new requests."""
page = RescrapItem()
...
yield page
if isinstance(response, (HtmlResponse, SplashTextResponse)):
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = SplashRequest(url=link.url, callback=self._response_downloaded,
args=SPLASH_RENDER_ARGS)
r.meta.update(rule=rule, link_text=link.text)
yield rule.process_request(r)
+1, any update for this issue?
@hieu-n i use the code you paste here, and change splash request to request since i need to use the header, but it doesn't work, the spider still crawl the first depth content, any suggestion will be appreciated
@NingLu I haven't touched scrapy for a while. In your case, what I would do is to set a few breakpoints and step through your code and the scrapy's code. Good luck!
+1 any updates here?
Hello everyone ! As @dwj1324 said the CrawlSpider do a response type check in _requests_to_follow function. So I've juste overridden this function to avoid escaping SplashJsonResponse(s):
hope this helps !
Having the same issue. Have overridden _requests_to_follow
as stated by @dwj1324 and @dijadev.
As soon as I start using splash by adding the following code to my spider:
def start_requests(self):
for url in self.start_urls:
print('->', url)
yield SplashRequest(url, self.parse_item, args={'wait': 0.5})
it does not call _requests_to_follow
anymore. Scrapy follows links when commenting out that function again.
Hi, I have found a workaround which works for me:
Instead of using a scrapy request:
yield scrapy.Request(page_url, self.parse_page)
simply append this splash prefix to the url:
yield scrapy.Request("http://localhost:8050/render.html?url=" + page_url, self.parse_page)
the localhost port may depend on how you built spalsh docker
Hi, I have found a workaround which works for me: Instead of using a scrapy request:
yield scrapy.Request(page_url, self.parse_page)
simply append this splash prefix to the url:yield scrapy.Request("http://localhost:8050/render.html?url=" + page_url, self.parse_page)
the localhost port may depend on how you built spalsh docker
@VictorXunS this is not working for me, could you share all your CrawlSpider code?
Also had problems combining Crawlspider with SplashRequests and Crawlera. Overwriting the _requests_to_follow function by taking the whole if isinstance condition away worked for me. Thanks @dijadev and @hieu-n for suggestions.
` def _requests_to_follow(self, response):
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
`
` def _build_request(self, rule, link):
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=rule, link_text=link.text)
return r
`
I am not expert, but scrapy has its own filter, isn't it? (you use not seen)
http://doc.scrapy.org/en/latest/topics/link-extractors.html http://doc.scrapy.org/en/latest/topics/link-extractors.html class scrapy.linkextractors.lxmlhtml.LxmlLinkExtracto
http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail Libre de virus. www.avg.com http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
El lun., 18 feb. 2019 a las 20:17, Nick-Verdegem (notifications@github.com) escribió:
Also had problems combining Crawlspider with SplashRequests and Crawlera. Overwriting the _requests_to_follow function by taking the whole if isinstance condition away worked for me. Thanks @dijadev https://github.com/dijadev and @hieu-n https://github.com/hieu-n for suggestions.
` def _requests_to_follow(self, response): seen = set() for n, rule in enumerate(self._rules): links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen] if links and rule.process_links: links = rule.process_links(links) for link in links: seen.add(link) r = self._build_request(n, link) yield rule.process_request(r)
def _build_request(self, rule, link): r = Request(url=link.url, callback=self._response_downloaded) r.meta.update(rule=rule, link_text=link.text) return r
`
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/92#issuecomment-464848375, or mute the thread https://github.com/notifications/unsubscribe-auth/Agwyu87plFAMY-qF8MolZsRwKXMp4Imrks5vOvxCgaJpZM4Ku50c .
Hi @Nick-Verdegem thank you for sharing. My CrawlSPider is still not working with your solution, do you use start_requests?
So i encountered this issue and solved it by overriding the type check as suggested :
def _requests_to_follow(self, response):
if not isinstance(response, (HtmlResponse, SplashTextResponse)):
return
....
but also u have to not use the SplashRequest in ur process_request method to create the new splash requests, just add splash to ur scrapy.Request meta, because the scrapy.Request returned from the _requests_to_follow method has attribute in its meta like the index of the 'rule' its generated by that it uses for its logic, so u dont want to generate a totally different request by using SplashRequest in ur request wrapper just add splash to the already built request like so:
def use_splash(self, request):
request.meta.update(splash={
'args': {
'wait': 1,
},
'endpoint': 'render.html',
})
return request
and add it to ur Rule :
process_request="use_splash"
the _requests_to_follow will apply the process_request to every built request, thats what worked for my CrawlSpider
Hope that helps!
I use scrapy-splash and scrapy-redis
RedisCrawlSpider can running.
Need to rewrite
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse_m, endpoint='execute', dont_filter=True, args={
'url': url, 'wait': 5, 'lua_source': default_script
})
def _requests_to_follow(self, response):
if not isinstance(response, (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
def _build_request(self, rule, link):
# parameter 'meta' is required !!!!!
r = SplashRequest(url=link.url, callback=self._response_downloaded, meta={'rule': rule, 'link_text': link.text},
args={'wait': 5, 'url': link.url, 'lua_source': default_script})
# Maybe you can delete it here.
r.meta.update(rule=rule, link_text=link.text)
return r
Some parameters need to be modified by themselves
@MontaLabidi Your solution worked for me.
This is how my code looks:
class MySuperCrawler(CrawlSpider):
name = 'mysupercrawler'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com']
rules = (
Rule(LxmlLinkExtractor(
restrict_xpaths='//div/a'),
follow=True
),
Rule(LxmlLinkExtractor(
restrict_xpaths='//div[@class="pages"]/li/a'),
process_request="use_splash",
follow=True
),
Rule(LxmlLinkExtractor(
restrict_xpaths='//a[@class="product"]'),
callback='parse_item',
process_request="use_splash"
)
)
def _requests_to_follow(self, response):
if not isinstance(
response,
(HtmlResponse, SplashJsonResponse, SplashTextResponse)):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
def use_splash(self, request):
request.meta.update(splash={
'args': {
'wait': 1,
},
'endpoint': 'render.html',
})
return request
def parse_item(self, response):
pass
This works perfectly for me.
@sp-philippe-oger could you please show the whole file? In my case the crawl spider won't call the redefined _requests_to_follow and as a consequence still stops after the first page...
@digitaldust pretty much the whole code is there. Not sure what is missing for you to make it work.
@sp-philippe-oger don't worry, I actually realized my problem is with the LinkExtractor, not the scrapy/splash combo... thanks!
Anyone get this to work while running a Lua script for each pagination?
@nciefeiniu hi... would you please give more information about integrating scrapy-redis with splash? i mean, how do you send your urls from redis to splash?
@MontaLabidi Your solution worked for me.
This is how my code looks:
class MySuperCrawler(CrawlSpider): name = 'mysupercrawler' allowed_domains = ['example.com'] start_urls = ['https://www.example.com'] rules = ( Rule(LxmlLinkExtractor( restrict_xpaths='//div/a'), follow=True ), Rule(LxmlLinkExtractor( restrict_xpaths='//div[@class="pages"]/li/a'), process_request="use_splash", follow=True ), Rule(LxmlLinkExtractor( restrict_xpaths='//a[@class="product"]'), callback='parse_item', process_request="use_splash" ) ) def _requests_to_follow(self, response): if not isinstance( response, (HtmlResponse, SplashJsonResponse, SplashTextResponse)): return seen = set() for n, rule in enumerate(self._rules): links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen] if links and rule.process_links: links = rule.process_links(links) for link in links: seen.add(link) r = self._build_request(n, link) yield rule.process_request(r) def use_splash(self, request): request.meta.update(splash={ 'args': { 'wait': 1, }, 'endpoint': 'render.html', }) return request def parse_item(self, response): pass
This works perfectly for me.
I use python3, but there's an error: _identity_process_request() missing 1 required positional argument. Is there something wrong?
Since Scrapy 1.7.0, the process_request
callback also receives a response
parameter, so you need to change def use_splash(self, request):
to def use_splash(self, request, response):
If someone runs into the same problem needing to use Splash in a CrawlSpider (with Rule and LinkExtractor) BOTH for parse_item and the initial start_requests e.g. to bypass cloudflare that's my solution:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_splash import SplashRequest, SplashJsonResponse, SplashTextResponse
from scrapy.http import HtmlResponse
class Abc(scrapy.Item):
name = scrapy.Field()
class AbcSpider(CrawlSpider):
name = "abc"
allowed_domains = ['abc.de']
start_urls = ['https://www.abc.com/xyz']
rules = (Rule(LinkExtractor(restrict_xpaths='//h2[@class="abc"]'), callback='parse_item', process_request="use_splash"))
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, args={'wait': 15}, meta={'real_url': url})
def use_splash(self, request):
request.meta['splash'] = {
'endpoint':'render.html',
'args':{
'wait': 15,
}
}
return request
def _requests_to_follow(self, response):
if not isinstance(
response,
(HtmlResponse, SplashJsonResponse, SplashTextResponse)):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
def parse_item(self, response):
item = Abc()
item['name'] = response.xpath('//div[@class="abc-name"]/h1/text()').get()
return item
Since Scrapy 1.7.0, the
process_request
callback also receives aresponse
parameter, so you need to changedef use_splash(self, request):
todef use_splash(self, request, response):
It does not work, throws an error use_splash() is missing 1 required positional argument: 'response'
@vishKurama Which Scrapy version are you using? Can you share a minimal, reproducible example?
Since Scrapy 1.7.0, the
process_request
callback also receives aresponse
parameter, so you need to changedef use_splash(self, request):
todef use_splash(self, request, response):
It does not work, throws an error use_splash() is missing 1 required positional argument: 'response'
I had this problem too. Just use yield rule.process_request(r, response)
in the last line of the overridden method
I am facing a similar problem and the solutions listed here aren't working for me, unless I've missed something!
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashRequest
import logging
class MainSpider(CrawlSpider):
name = 'main'
allowed_domains = ['www.somesite.com']
script = '''
function main(splash, args)
splash.private_mode_enabled = false
my_user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
headers = {
['User-Agent'] = my_user_agent,
['Accept-Language'] = 'en-GB,en-US;q=0.9,en;q=0.8',
['Referer'] = 'https://www.google.com'
}
splash:set_custom_headers(headers)
url = args.url
assert(splash:go(url))
assert(splash:wait(2))
-- username input
username_input = assert(splash:select('#username'))
username_input:focus()
username_input:send_text('myusername')
assert(splash:wait(0.3))
-- password input
password_input = assert(splash:select('#password'))
password_input:focus()
password_input:send_text('mysecurepass')
assert(splash:wait(0.3))
-- the login button
login_btn = assert(splash:select('#login_btn'))
login_btn:mouse_click()
assert(splash:wait(4))
return splash:html()
end
'''
rules = (
Rule(LinkExtractor(restrict_xpaths="(//div[@id='sidebar']/ul/li)[7]/a"), callback='parse_item', follow=True, process_request='use_splash'),
)
def start_requests(self):
yield SplashRequest(url = 'https://www.somesite.com/login', callback = self.post_login, endpoint = 'execute', args = {
'lua_source': self.script
})
def use_splash(self, request):
request.meta.update(splash={
'args': {
'wait': 1,
},
'endpoint': 'render.html',
})
return request
def _requests_to_follow(self, response):
if not isinstance(response, (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
def post_login(self, response):
logging.info('hey from post login!')
with open('post_login_response.txt', 'w') as f:
f.write(response.text)
f.close()
def parse_item(self, response):
logging.info('hey from parse_item!')
with open('post_search_response.txt', 'w') as f:
f.write(response.text)
f.close()
The parse_item
function is never hit, in the logs, I never see hey from parse_item!
but I do see hey from post login
. I'm not sure what I'm missing.
Following is a working crawler for scraping https://books.toscrape.com
. Tested with Scrapy version 2.9.0
. For installing and configuring splash, follow the README.
import scrapy
from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashRequest, SplashTextResponse, SplashJsonResponse
class FictionBookScrapper(CrawlSpider):
_WAIT = 0.1
name = "fiction_book_scrapper"
allowed_domains = ['books.toscrape.com']
start_urls = ["https://books.toscrape.com/catalogue/category/books_1/index.html"]
le_book_details = LinkExtractor(restrict_css=("h3 > a",))
rule_book_details = Rule(le_book_details, callback='parse_request', follow=False, process_request='use_splash')
le_next_page = LinkExtractor(restrict_css='.next > a')
rule_next_page = Rule(le_next_page, follow=True, process_request='use_splash')
rules = (
rule_book_details,
rule_next_page,
)
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, args={'wait': self._WAIT}, meta={'real_url': url})
def use_splash(self, request, response):
request.meta['splash'] = {
'endpoint': 'render.html',
'args': {
'wait': self._WAIT
}
}
return request
def _requests_to_follow(self, response):
if not isinstance(response, (HtmlResponse, SplashTextResponse, SplashJsonResponse)):
return
seen = set()
for rule_index, rule in enumerate(self._rules):
links = [
lnk
for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen
]
for link in rule.process_links(links):
seen.add(link)
request = self._build_request(rule_index, link)
yield rule.process_request(request, response)
def parse_request(self, response: scrapy.http.Response):
self.logger.info(f'Page status code = {response.status}, url= {response.url}')
yield {
'Title': response.css('h1 ::text').get(),
'Link': response.url,
'Description': response.xpath('//*[@id="content_inner"]/article/p/text()').get()
}
Hi !
I have integrated scrapy-splash in my CrawlSpider process_request in rules like this:
The problem is that the crawl renders just urls in the first depth, I wonder also how can I get response even with bad http code or redirected reponse;
Thanks in advance,