Closed kmike closed 4 years ago
Another option is Request.from_response()
, where relative urls are already supported by FormRequest.from_response()
.
I can see why a spider middleware looks sweet but it actually requires allowing requests to be created with relative urls and later replace the request by its absolute url counterpart.
I prefer Request.from_response()
backed by response.urljoin()
.
Let me advocate adding direct support for relative urls :)
Most spider (all?) work in the following fashion:
1) spider makes some initial requests based on start_requests
(all urls are absolute there);
2) spider makes additional requests from parse...
methods using urls extracted from the page being parsed.
Because urls extracted in (2) are most often extracted from the page being parsed, they are most often should be handled as relative. Almost every spider needs it.
Sometimes relative urls are not handled because developer assumes that website provides absolute urls, so there is no motivation for developer to handle relative urls. If developer's assumption is wrong (e.g. urls are absolute on some pages, but relative on others) this leads to an error.
a) If we fix #494, existing way to write spiders would be the following:
import scrapy
import urlparse
class MySpider(scrapy.Spider):
# ...
def parse(self, response):
sel = scrapy.Selector(response)
for href in sel.xpath('//a/@href').extract():
url = urlparse.urljoin(response.url, href)
yield scrapy.Request(url)
b) with your proposal:
import scrapy
class MySpider(scrapy.Spider):
# ...
def parse(self, response):
sel = scrapy.Selector(response)
for href in sel.xpath('//a/@href').extract():
yield scrapy.Request.from_response(response, url)
c) with direct support for relative urls:
import scrapy
class MySpider(scrapy.Spider):
# ...
def parse(self, response):
sel = scrapy.Selector(response)
for href in sel.xpath('//a/@href').extract():
yield scrapy.Request(href)
(b) and (c) are both improvements, but I like (c) more because this code needs to be written in almost all parse...
methods, and we are able to remove this boilerplate. Returning Request reads better and is a simpler "contract" for spider callbacks.
You're right that replacing Request may introduce bugs - some code can become broken if it checks for request identity. Alternatively, we'll have to bring some mutability back (maybe internally). Also, we'll have to move url validation to a later place. But for me it all looks like a reasonable price for a better API.
there is another option, use the response as a request factory like:
import scrapy
class MySpider(scrapy.Spider):
# ...
def parse(self, response):
sel = scrapy.Selector(response)
for href in sel.xpath('//a/@href').extract():
yield response.newrequest(href)
I dislike the idea of allowing relative urls on request.url
and relying on a spider middleware to fix them at the correct order of all possible enabled middlewares in a project, it's prone to mistakes and gotchas. Also, think about base spider classes used as generics that allow extending a few methods, the callers have to take in count that returned request may not be absolute, should they absolutize them or make its best with the relative url? It will break current scrapy projects as users will expect spider methods to handle relative urls everywhere. This is a problem right now when callback's return value is a single-request vs a requests-generator, it's common to find calls to scrapy.utils.spider.iterate_spider_output
wrapping methods called from the spider itself.
Aside of that, requests can be injected skipping spider middlewares (engine.crawl() or engine.download()), or the AbsolutizeMiddleware can be disabled, in this cases the error is logged by the downloader instead of at instantation time, far less useful from a debug perspective IMHO.
in #554 you propose to add methods to the response for selecting html/xml documents, if we follow that path it is also logical to integrate FormRequest
somehow (i.e.: response.submitform()
), so what is wrong with building requests derived from a response from the response itself?
How about doing the opposite and not working with relative URLs in the spider? Where do we need relative URLs in the spider? Instead of handing off relative links and hoping some middleware will fix it , I'd prefer getting an absolute URL back from the selector.
So maybe the HtmlResponse could support rewriting links based on <base />
and other logic, as a browser might when saving a 'webpage' locally.
Or the Selector might support something in the sense of .extract_absolute_url()
, which would rewrite this URL or any URL inside the selected content, if that's feasible. Then I'd keep some control and could do it selectively.
But talking semantics, I'd take Request.from_response()
over response.newrequest()
, wherever I might get that request from in the first place.
Or make that Request(url_or_response)
resp. FormRequest(url_or_response)
?
Maybe self.request(response)
could also work instead of scrapy.Request()
, because a request factory on response looks somehow wrong to me.
response.request
is already the request that generated this response. So
response.newrequest
is confusing and is "flowing" in the wrong direction, if that makes sense.
I don't feel like responses should generate anything. Sorry dangra.
In general, I'm no so sure about this path (+#554). Starting to look more like jQuery's way of chaining to hell and back. Then again, looking at jQuery's following, maybe that's a good thing.
Instead of handing off relative links and hoping some middleware will fix it , I'd prefer getting an absolute URL back from the selector.
Although, I don't think selectors deserves this task, mostly because linkextractors (#559) are designed to deal with it instead, I will take a step toward this idea in the spirit of #494 and #554 and combine linkextractors into selectors API like this:
import scrapy
class Spider(scrapy.Spider):
def parse(response):
for lnk in response.sel.xpath('//span[@class="breadcrumb"]').links():
yield scrapy.Request(lnk.url)
here .links()
takes the usual link extractor arguments except for restrict_xpaths.
But talking semantics, I'd take Request.from_response() over response.newrequest(), wherever I might get that request from in the first place.
I tend to prefer Request.from_response()
too, it is an already used idiom and it doesn't interfere with your other idea about using selectors as scope filter in link extraction.
Or make that Request(url_or_response) resp. FormRequest(url_or_response)?
hmm... I don't get it.
Maybe self.request(response) could also work instead of scrapy.Request(), because a request factory on response looks somehow wrong to me.
I don't see how self.request()
could help with the absolute url problem.
response.request is already the request that generated this response. So response.newrequest is confusing and is "flowing" in the wrong direction, if that makes sense. I don't feel like responses should generate anything. Sorry dangra.
No hard feelings, this is about getting into discussions to reach a good end :)
Although, I don't think selectors deserves this task, mostly because linkextractors (#559) are designed to deal with it instead, I will take a step toward this idea in the spirit of #494 and #554 and combine linkextractors into selectors API
@dangra now this idea became good? https://github.com/scrapy/scrapy/pull/331
@dangra now this idea became good? #331
As I said, it doesn't convince me but worth revising it now that it gets more eyeballs, and can be part of the broad improvements to the API.
I like @nyov's, @dangra's and @nramirezuy's ideas to fix this issue by providing an easier way to extract absolute urls. This is an indirect solution, but it seems that .links()
method could work well in practice.
No hard feelings from me too, just starting the discussion :)
@kmike @dangra How about this and just create a filter/processor? https://github.com/scrapy/scrapy/issues/578
@dangra
Or make that Request(url_or_response) resp. FormRequest(url_or_response)?
hmm... I don't get it.
I was meaning Request(url=)
might be simplified to take a response object instead of the more verbose Request.from_response()
. But it's slightly unrelated.
I don't see how self.request() could help with the absolute url problem.
No, but it would remove the need to import scrapy for scrapy.Request()
. So in the interest of https://github.com/scrapy/scrapy/issues/554#issuecomment-33066705, I was thinking the class might have the request factory in that case, instead of the response.
I really like the way lxml
deals with links. I propose this syntax:
sel = Selector(response)
sel.make_links_absolute(response.url)
I found @juanriaza proposal interesting. But maybe we can do that another way - add optional argument make_links_absolute
to the Selector
constructor - which will default to True
because it's most common usecase, but it should be possible to disable this feature to get unmodified links.
The Selector instance already references the response, no need to pass it as a parameter. Although I like having shortcuts to lxml's methods, I wouldn't use this one, I don't want to change the tree in place, I 'd prefer it as an argument to the constructor. P.S. Like @chekunkov sets it, but I 'd prefer False as default
1) Currently all Selectors for the same Response share the same copy of a parsed tree; inplace modification becomes more tricky because of that. We'll have to deepcopy the tree and change it if we don't want the changes to affect unrelated code.
2) make_links_absolute
as a Selector constructor argument doesn't allow to customize how response.xpath
and response.css
methods work, and response.css('.mycls a::attr(href)')
is an important use case for absolute links.
3) make_links_absolute
as a Selector method (not as a Selector constructor argument) changes the selector state; imho it is better to avoid adding state-mutating methods because they are a source of bugs.
4) Agreed with @Digenis - make_links_absolute
shouldn't be a default. For one, it is not possible to restore original url values after they are made absolute; we shouldn't mess with HTML by default.
I'm not sure I like that in @dangra's example I like an approach with link-extraction method more than links()
method returns Link objects and not urls directly. Butmake_links_absolute
because it does not require to modify a shared state - its effect is local. Users request links, they get them, no gotchas.
1) I haven't ever seen the case when several different Selectors
created for a single Response
. That really happens? Any reason for doing that? Even if spider downloads the same response second time it's often processed by the same Selector
. So if developer decides that he needs Selector
to return absolute links from response it's highly unlikely that for the same response later he will change his decision. And even if it will happen - he can use Selector
s with disabled absolute links from the start.
2) Oh, those new shortcuts :( That's inconvenience. First solution that came to mind - for customizing response selectors developer could pass some value to Request
meta (or maybe introduce new Request
argument which will control that behavior).
4) From my experience we want Selector
to return absolute urls almost all the time. I expect default value to be the most commonly used case, not one that doesn't change original values. If we will have default make_links_absolute = False
that will introduce extra typing for the developers - which we are trying to avoid as I understood from ticket description.
For one, it is not possible to restore original url values after they are made absolute
One can pass make_links_absolute = False
, get original values and do whatever he wants with them.
5) @dangra's .links()
proposal... I kinda dislike idea of adding extra methods to Selector
API. But maybe you have a point.
1) This happen in practice a lot: stored references to "filtered" selectors like divs = sel.xpath('//div')
(there are Selectors in a SelectorList), utility functions/methods that take response as a parameter and create Selectors to extract data, etc.
2) Request meta is a wrong place to control the details of response HTML parsing, this should be done in a callback.
4) Agreed, I also think that extracting absolute links is a more common case. But changing the default could be surprising, it has an overhead, it could be non-obvious how to get the original values, it is backwards-incompatible, and there are (1), (2) and (3) to solve.
The biggest disadvantage of links()
is that it is a new API, but it dosn't have the problems that make_links_absolute
has.
I think make_links_absolute
shouldn't be an argument for the constructor but a Selector
method.
The way make_links_absolute
works in lxml
is per element, not for the whole tree. It would make sense make_links_absolute
to work with xpath
/css
methods:
>>> from lxml.html import fromstring, tostring
>>> base_url = 'http://juanriaza.com'
>>> html_1 = '<a href="first.html">First</a>'
>>> html_2 = '<a href="second.html">Second</a>'
>>> html_3 = html_1 + html_2
>>> test_1 = fromstring(html_3)
>>> test_1.make_links_absolute(base_url=base_url)
>>> tostring(test_1)
'<span><a href="http://juanriaza.com/first.html">First</a><a href="http://juanriaza.com/second.html">Second</a></span>'
>>> test_2 = fromstring(html_3)
>>> test_2.cssselect('a')[0].make_links_absolute(base_url=base_url)
>>> tostring(test_2)
'<span><a href="http://juanriaza.com/first.html">First</a><a href="second.html">Second</a></span>'
@juanriaza make_links_absolute
as a method bugs me because it modifies the tree inplace. I'd prefer something like this:
sel2 = sel.with_absolute_links(attrs=['href', 'src'], resolve_base=False)
# sel2 now emits absolute links, sel still emits relative links
But why is it better than links
method?
@kmike I really like the with_absolute_links
idea. What would the links
method return?
@juanriaza links()
method would return scrapy.link.Link objects. It is more specific and maybe less versatile than with_absolute_links()
, but the code may become shorter in common cases.
def parse(response):
for url in response.xpath('//span[@class="breadcrumb"]//a[@href]').extract():
yield scrapy.Request(urljoin(response.url, url))
def parse(response):
sel = response.selector.with_absolute_links()
for url in sel.xpath('//span[@class="breadcrumb"]//a[@href]').extract():
yield scrapy.Request(url)
def parse(response):
for link in response.xpath('//span[@class="breadcrumb"]').links():
yield scrapy.Request(link.url)
@kmike
and there are (1), (2) and (3) to solve.
(1), (2). I think make_links_absolute
should not be a separate method from the beginning - because of the reasons you've guys already mentioned :)
The biggest disadvantage of links() is that it is a new API, but it dosn't have the problems that make_links_absolute has.
hm, ok, you almost convinced me, maybe .links()
would be better solution.
In that case I would prefer something like iterlinks
. It's really useful to access not just the url but also the element.
I'm not 100% happy with any of the options so far. There is an issue where one wants to extract multiple attributes from the same element, like
for a in response.css('a.data-link'):
cls = a.xpath('@class').extract()[0]
title = a.xpath('@title').extract()[0]
href = a.xpath('@href').extract()[0]
# ??? href = a.links()[0]
in this case with_absolute_links
would work in a saner way, but using xpaths to extract element attributes is still ugly. Usually in such case you use ItemLoader; with_absolute_links
seems to play better with ItemLoaders.
@juanriaza I was also thinking about proposing to add a reference to the element to Link object. But there could be a reason why it was not added before.
From my pov, the ideal solution would be to implement both: a with_absolute_links
method and an iterlinks
generator that yields (element, attribute, link, pos)
for every link as lxml
does.
@juanriaza instead of (element, attribute, link, pos)
tuple it may be better to yield Link objects which contain similar information and extra methods. They are already used in Scrapy.
@kmike agree.
Also imho elements should carry attributes as a dict
but that is a concern for another issue
I think there are 2 different use cases
Extracting link from subtrees.
This seems to me a much broader concern than absolute url. I would not use it most situation, because I need more control on which link I want to extract. Think about <a href="..."> < img src="..."></a>
def parse(response):
for link in response.xpath('//span[@class="breadcrumb"]').links():
yield scrapy.Request(link.url)
Translating href to urls. I believe this solve nicely the the absolute url issue without adding any complexity ( no middleware no extra arguments.... ) Any approach would generate either more coupling between classes or complexity in code.
def parse(response):
for url in response.xpath('//span[@class="breadcrumb"]//a[@href]').extract_as_url():
yield scrapy.Request(url)
For Scrapy 1.0 we're going to add response.urljoin()
method.
We haven't agreed on .links()
vs .extract_as_url()
vs .with_absolute_links()
vs other proposals yet.
response.urljoin
PR is merged. It is not the final solution for this problem, but we probably won't implement anything else for Scrapy 1.0, so I removed the milestone from this ticket.
@kmike , I have the feeling that the discussion for the remaining work is in https://github.com/scrapy/scrapy/issues/1940 If so, can this one be closed?
@redapple I think we should fix it from both sides: provide .links() method in parsel and create another API for scheduling requests in scrapy.
ok @kmike ,
create another API for scheduling requests in scrapy.
this is https://github.com/scrapy/scrapy/issues/1940, correct?
and
provide .links() method in parsel
can be covered withing https://github.com/scrapy/parsel/issues/26
Yep, if we agree that .links + #1940 is the way to solve this issue.
Scrapy now has response.follow
, which is very similar to response.newrequest
proposed by @dangra in https://github.com/scrapy/scrapy/issues/548#issuecomment-33066664. I think it solves the original issue - it allows to handle relative URLs correctly with minimal effor from the user.
@kmike How about this feature be available by default on link extraction? I think, cuz everyone has to parse this, but it a common use for browsers.
Found what I needed in @kmike's comment above. Is there some reason to not consider that solution sufficient to close this issue?
Fixed in #2540.
Building an URL relative to current URL is a very common task; currently users are required to do that themselves -
import urlparse
and thenurlparse.urljoin(response.url, href)
.What do you think about adding a shortcut for that - something like
response.urljoin(href)
orresponse.absolute_url(href)
?Another crazy idea is to support relative urls directly, by assuming they are relative to a
response.url
. I think it can be implemented as a spider middleware. I actually like this idea :)