scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Support relative urls better #548

Closed kmike closed 4 years ago

kmike commented 10 years ago

Building an URL relative to current URL is a very common task; currently users are required to do that themselves - import urlparse and then urlparse.urljoin(response.url, href).

What do you think about adding a shortcut for that - something like response.urljoin(href) or response.absolute_url(href)?

Another crazy idea is to support relative urls directly, by assuming they are relative to a response.url. I think it can be implemented as a spider middleware. I actually like this idea :)

dangra commented 10 years ago

Another option is Request.from_response(), where relative urls are already supported by FormRequest.from_response().

I can see why a spider middleware looks sweet but it actually requires allowing requests to be created with relative urls and later replace the request by its absolute url counterpart.

I prefer Request.from_response() backed by response.urljoin().

kmike commented 10 years ago

Let me advocate adding direct support for relative urls :)

Most spider (all?) work in the following fashion:

1) spider makes some initial requests based on start_requests (all urls are absolute there); 2) spider makes additional requests from parse... methods using urls extracted from the page being parsed.

Because urls extracted in (2) are most often extracted from the page being parsed, they are most often should be handled as relative. Almost every spider needs it.

Sometimes relative urls are not handled because developer assumes that website provides absolute urls, so there is no motivation for developer to handle relative urls. If developer's assumption is wrong (e.g. urls are absolute on some pages, but relative on others) this leads to an error.

a) If we fix #494, existing way to write spiders would be the following:

import scrapy
import urlparse

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response):
        sel = scrapy.Selector(response)
        for href in sel.xpath('//a/@href').extract():
            url = urlparse.urljoin(response.url, href)
            yield scrapy.Request(url)

b) with your proposal:

import scrapy

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response):
        sel = scrapy.Selector(response)
        for href in sel.xpath('//a/@href').extract():
            yield scrapy.Request.from_response(response, url)

c) with direct support for relative urls:

import scrapy

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response):
        sel = scrapy.Selector(response)
        for href in sel.xpath('//a/@href').extract():
            yield scrapy.Request(href)

(b) and (c) are both improvements, but I like (c) more because this code needs to be written in almost all parse... methods, and we are able to remove this boilerplate. Returning Request reads better and is a simpler "contract" for spider callbacks.

You're right that replacing Request may introduce bugs - some code can become broken if it checks for request identity. Alternatively, we'll have to bring some mutability back (maybe internally). Also, we'll have to move url validation to a later place. But for me it all looks like a reasonable price for a better API.

dangra commented 10 years ago

there is another option, use the response as a request factory like:

import scrapy

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response):
        sel = scrapy.Selector(response)
        for href in sel.xpath('//a/@href').extract():
            yield response.newrequest(href)

I dislike the idea of allowing relative urls on request.url and relying on a spider middleware to fix them at the correct order of all possible enabled middlewares in a project, it's prone to mistakes and gotchas. Also, think about base spider classes used as generics that allow extending a few methods, the callers have to take in count that returned request may not be absolute, should they absolutize them or make its best with the relative url? It will break current scrapy projects as users will expect spider methods to handle relative urls everywhere. This is a problem right now when callback's return value is a single-request vs a requests-generator, it's common to find calls to scrapy.utils.spider.iterate_spider_output wrapping methods called from the spider itself.

Aside of that, requests can be injected skipping spider middlewares (engine.crawl() or engine.download()), or the AbsolutizeMiddleware can be disabled, in this cases the error is logged by the downloader instead of at instantation time, far less useful from a debug perspective IMHO.

in #554 you propose to add methods to the response for selecting html/xml documents, if we follow that path it is also logical to integrate FormRequest somehow (i.e.: response.submitform()), so what is wrong with building requests derived from a response from the response itself?

nyov commented 10 years ago

How about doing the opposite and not working with relative URLs in the spider? Where do we need relative URLs in the spider? Instead of handing off relative links and hoping some middleware will fix it , I'd prefer getting an absolute URL back from the selector.

So maybe the HtmlResponse could support rewriting links based on <base /> and other logic, as a browser might when saving a 'webpage' locally. Or the Selector might support something in the sense of .extract_absolute_url(), which would rewrite this URL or any URL inside the selected content, if that's feasible. Then I'd keep some control and could do it selectively.

But talking semantics, I'd take Request.from_response() over response.newrequest(), wherever I might get that request from in the first place. Or make that Request(url_or_response) resp. FormRequest(url_or_response)?

Maybe self.request(response) could also work instead of scrapy.Request(), because a request factory on response looks somehow wrong to me. response.request is already the request that generated this response. So response.newrequest is confusing and is "flowing" in the wrong direction, if that makes sense. I don't feel like responses should generate anything. Sorry dangra.

In general, I'm no so sure about this path (+#554). Starting to look more like jQuery's way of chaining to hell and back. Then again, looking at jQuery's following, maybe that's a good thing.

dangra commented 10 years ago

Instead of handing off relative links and hoping some middleware will fix it , I'd prefer getting an absolute URL back from the selector.

Although, I don't think selectors deserves this task, mostly because linkextractors (#559) are designed to deal with it instead, I will take a step toward this idea in the spirit of #494 and #554 and combine linkextractors into selectors API like this:

import scrapy

class Spider(scrapy.Spider):

    def parse(response):
        for lnk in response.sel.xpath('//span[@class="breadcrumb"]').links():
            yield scrapy.Request(lnk.url)

here .links() takes the usual link extractor arguments except for restrict_xpaths.

dangra commented 10 years ago

But talking semantics, I'd take Request.from_response() over response.newrequest(), wherever I might get that request from in the first place.

I tend to prefer Request.from_response() too, it is an already used idiom and it doesn't interfere with your other idea about using selectors as scope filter in link extraction.

Or make that Request(url_or_response) resp. FormRequest(url_or_response)?

hmm... I don't get it.

Maybe self.request(response) could also work instead of scrapy.Request(), because a request factory on response looks somehow wrong to me.

I don't see how self.request() could help with the absolute url problem.

response.request is already the request that generated this response. So response.newrequest is confusing and is "flowing" in the wrong direction, if that makes sense. I don't feel like responses should generate anything. Sorry dangra.

No hard feelings, this is about getting into discussions to reach a good end :)

nramirezuy commented 10 years ago

Although, I don't think selectors deserves this task, mostly because linkextractors (#559) are designed to deal with it instead, I will take a step toward this idea in the spirit of #494 and #554 and combine linkextractors into selectors API

@dangra now this idea became good? https://github.com/scrapy/scrapy/pull/331

dangra commented 10 years ago

@dangra now this idea became good? #331

As I said, it doesn't convince me but worth revising it now that it gets more eyeballs, and can be part of the broad improvements to the API.

kmike commented 10 years ago

I like @nyov's, @dangra's and @nramirezuy's ideas to fix this issue by providing an easier way to extract absolute urls. This is an indirect solution, but it seems that .links() method could work well in practice.

No hard feelings from me too, just starting the discussion :)

nramirezuy commented 10 years ago

@kmike @dangra How about this and just create a filter/processor? https://github.com/scrapy/scrapy/issues/578

nyov commented 10 years ago

@dangra

Or make that Request(url_or_response) resp. FormRequest(url_or_response)?

hmm... I don't get it.

I was meaning Request(url=) might be simplified to take a response object instead of the more verbose Request.from_response(). But it's slightly unrelated.

I don't see how self.request() could help with the absolute url problem.

No, but it would remove the need to import scrapy for scrapy.Request(). So in the interest of https://github.com/scrapy/scrapy/issues/554#issuecomment-33066705, I was thinking the class might have the request factory in that case, instead of the response.

juanriaza commented 9 years ago

I really like the way lxml deals with links. I propose this syntax:

sel = Selector(response)
sel.make_links_absolute(response.url)
chekunkov commented 9 years ago

I found @juanriaza proposal interesting. But maybe we can do that another way - add optional argument make_links_absolute to the Selector constructor - which will default to True because it's most common usecase, but it should be possible to disable this feature to get unmodified links.

Digenis commented 9 years ago

The Selector instance already references the response, no need to pass it as a parameter. Although I like having shortcuts to lxml's methods, I wouldn't use this one, I don't want to change the tree in place, I 'd prefer it as an argument to the constructor. P.S. Like @chekunkov sets it, but I 'd prefer False as default

kmike commented 9 years ago

1) Currently all Selectors for the same Response share the same copy of a parsed tree; inplace modification becomes more tricky because of that. We'll have to deepcopy the tree and change it if we don't want the changes to affect unrelated code.

2) make_links_absolute as a Selector constructor argument doesn't allow to customize how response.xpath and response.css methods work, and response.css('.mycls a::attr(href)') is an important use case for absolute links.

3) make_links_absolute as a Selector method (not as a Selector constructor argument) changes the selector state; imho it is better to avoid adding state-mutating methods because they are a source of bugs.

4) Agreed with @Digenis - make_links_absolute shouldn't be a default. For one, it is not possible to restore original url values after they are made absolute; we shouldn't mess with HTML by default.

I'm not sure I like that in @dangra's example links() method returns Link objects and not urls directly. But I like an approach with link-extraction method more than make_links_absolute because it does not require to modify a shared state - its effect is local. Users request links, they get them, no gotchas.

chekunkov commented 9 years ago

1) I haven't ever seen the case when several different Selectors created for a single Response. That really happens? Any reason for doing that? Even if spider downloads the same response second time it's often processed by the same Selector. So if developer decides that he needs Selector to return absolute links from response it's highly unlikely that for the same response later he will change his decision. And even if it will happen - he can use Selectors with disabled absolute links from the start.

2) Oh, those new shortcuts :( That's inconvenience. First solution that came to mind - for customizing response selectors developer could pass some value to Request meta (or maybe introduce new Request argument which will control that behavior).

4) From my experience we want Selector to return absolute urls almost all the time. I expect default value to be the most commonly used case, not one that doesn't change original values. If we will have default make_links_absolute = False that will introduce extra typing for the developers - which we are trying to avoid as I understood from ticket description.

For one, it is not possible to restore original url values after they are made absolute

One can pass make_links_absolute = False, get original values and do whatever he wants with them.

5) @dangra's .links() proposal... I kinda dislike idea of adding extra methods to Selector API. But maybe you have a point.

kmike commented 9 years ago

1) This happen in practice a lot: stored references to "filtered" selectors like divs = sel.xpath('//div') (there are Selectors in a SelectorList), utility functions/methods that take response as a parameter and create Selectors to extract data, etc.

2) Request meta is a wrong place to control the details of response HTML parsing, this should be done in a callback.

4) Agreed, I also think that extracting absolute links is a more common case. But changing the default could be surprising, it has an overhead, it could be non-obvious how to get the original values, it is backwards-incompatible, and there are (1), (2) and (3) to solve.

The biggest disadvantage of links() is that it is a new API, but it dosn't have the problems that make_links_absolute has.

juanriaza commented 9 years ago

I think make_links_absolute shouldn't be an argument for the constructor but a Selector method.

The way make_links_absolute works in lxml is per element, not for the whole tree. It would make sense make_links_absolute to work with xpath/css methods:

>>> from lxml.html import fromstring, tostring
>>> base_url = 'http://juanriaza.com'
>>> html_1 = '<a href="first.html">First</a>'
>>> html_2 = '<a href="second.html">Second</a>'
>>> html_3 = html_1 + html_2
>>> test_1 = fromstring(html_3)
>>> test_1.make_links_absolute(base_url=base_url)
>>> tostring(test_1)
'<span><a href="http://juanriaza.com/first.html">First</a><a href="http://juanriaza.com/second.html">Second</a></span>'
>>> test_2 = fromstring(html_3)
>>> test_2.cssselect('a')[0].make_links_absolute(base_url=base_url)
>>> tostring(test_2)
'<span><a href="http://juanriaza.com/first.html">First</a><a href="second.html">Second</a></span>'
kmike commented 9 years ago

@juanriaza make_links_absolute as a method bugs me because it modifies the tree inplace. I'd prefer something like this:

sel2 = sel.with_absolute_links(attrs=['href', 'src'], resolve_base=False)
# sel2 now emits absolute links, sel still emits relative links

But why is it better than links method?

juanriaza commented 9 years ago

@kmike I really like the with_absolute_links idea. What would the links method return?

kmike commented 9 years ago

@juanriaza links() method would return scrapy.link.Link objects. It is more specific and maybe less versatile than with_absolute_links(), but the code may become shorter in common cases.

def parse(response):
    for url in response.xpath('//span[@class="breadcrumb"]//a[@href]').extract():
        yield scrapy.Request(urljoin(response.url, url))

def parse(response):
    sel = response.selector.with_absolute_links()
    for url in sel.xpath('//span[@class="breadcrumb"]//a[@href]').extract():
        yield scrapy.Request(url)

def parse(response):
    for link in response.xpath('//span[@class="breadcrumb"]').links():
        yield scrapy.Request(link.url)
chekunkov commented 9 years ago

@kmike

and there are (1), (2) and (3) to solve.

(1), (2). I think make_links_absolute should not be a separate method from the beginning - because of the reasons you've guys already mentioned :)

The biggest disadvantage of links() is that it is a new API, but it dosn't have the problems that make_links_absolute has.

hm, ok, you almost convinced me, maybe .links() would be better solution.

juanriaza commented 9 years ago

In that case I would prefer something like iterlinks. It's really useful to access not just the url but also the element.

kmike commented 9 years ago

I'm not 100% happy with any of the options so far. There is an issue where one wants to extract multiple attributes from the same element, like

for a in response.css('a.data-link'):
    cls = a.xpath('@class').extract()[0]
    title = a.xpath('@title').extract()[0]
    href = a.xpath('@href').extract()[0]
    # ??? href = a.links()[0] 

in this case with_absolute_links would work in a saner way, but using xpaths to extract element attributes is still ugly. Usually in such case you use ItemLoader; with_absolute_links seems to play better with ItemLoaders.

@juanriaza I was also thinking about proposing to add a reference to the element to Link object. But there could be a reason why it was not added before.

juanriaza commented 9 years ago

From my pov, the ideal solution would be to implement both: a with_absolute_links method and an iterlinks generator that yields (element, attribute, link, pos) for every link as lxml does.

kmike commented 9 years ago

@juanriaza instead of (element, attribute, link, pos) tuple it may be better to yield Link objects which contain similar information and extra methods. They are already used in Scrapy.

juanriaza commented 9 years ago

@kmike agree.

Also imho elements should carry attributes as a dict but that is a concern for another issue

lenzai commented 9 years ago

I think there are 2 different use cases

Extracting link from subtrees. This seems to me a much broader concern than absolute url. I would not use it most situation, because I need more control on which link I want to extract. Think about <a href="..."> < img src="..."></a>

def parse(response):
    for link in response.xpath('//span[@class="breadcrumb"]').links():
        yield scrapy.Request(link.url)

Translating href to urls. I believe this solve nicely the the absolute url issue without adding any complexity ( no middleware no extra arguments.... ) Any approach would generate either more coupling between classes or complexity in code.

def parse(response):
    for url in response.xpath('//span[@class="breadcrumb"]//a[@href]').extract_as_url():
        yield scrapy.Request(url)
kmike commented 9 years ago

For Scrapy 1.0 we're going to add response.urljoin() method.

We haven't agreed on .links() vs .extract_as_url() vs .with_absolute_links() vs other proposals yet.

kmike commented 9 years ago

response.urljoin PR is merged. It is not the final solution for this problem, but we probably won't implement anything else for Scrapy 1.0, so I removed the milestone from this ticket.

redapple commented 7 years ago

@kmike , I have the feeling that the discussion for the remaining work is in https://github.com/scrapy/scrapy/issues/1940 If so, can this one be closed?

kmike commented 7 years ago

@redapple I think we should fix it from both sides: provide .links() method in parsel and create another API for scheduling requests in scrapy.

redapple commented 7 years ago

ok @kmike ,

create another API for scheduling requests in scrapy.

this is https://github.com/scrapy/scrapy/issues/1940, correct?

and

provide .links() method in parsel

can be covered withing https://github.com/scrapy/parsel/issues/26

kmike commented 7 years ago

Yep, if we agree that .links + #1940 is the way to solve this issue.

kmike commented 7 years ago

Scrapy now has response.follow, which is very similar to response.newrequest proposed by @dangra in https://github.com/scrapy/scrapy/issues/548#issuecomment-33066664. I think it solves the original issue - it allows to handle relative URLs correctly with minimal effor from the user.

dannyeuu commented 6 years ago

@kmike How about this feature be available by default on link extraction? I think, cuz everyone has to parse this, but it a common use for browsers.

mc0e commented 4 years ago

Found what I needed in @kmike's comment above. Is there some reason to not consider that solution sufficient to close this issue?

Gallaecio commented 4 years ago

Fixed in #2540.