scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
52.32k stars 10.46k forks source link

Offsite middleware ignoring port #50

Closed eedeep closed 4 years ago

eedeep commented 12 years ago

In my spider I have the following:

class MySpider(BaseSpider):

    allowed_domains = ['192.169.0.15:8080']

and in the parse method I do something like:

    yield Request('http://192.169.0.15:8080/mypage.html', self.my_callback_function)

the result when I run the code is that that scrapy reports:

DEBUG: Filtered offsite request to '192.168.0.15': <GET http://192.168.0.15:8080/mypage.html>

Which is wrong - it seems to be ignoring the port. If I change the allowed_domains to:

    allowed_domains = ['192.169.0.15:8080', '192.16.0.15']

Then it works as you would expect it to. No big deal, can work around it but I think it is a bug. The problem being located in the should_follow method of the OffsiteMiddleware class in contrib/spidermiddleware/offsite.py

immerrr commented 7 years ago

@redapple can you reopen this? The problem seems to have resurfaced recently, it happens in 1.0.3 and i didn't see any changes touching this in recent releases.

redapple commented 7 years ago

@immerrr , do you have a reproducible example?

immerrr commented 7 years ago

something like this should do:

# -*- coding: utf-8 -*-
import scrapy

class Localhost8000Spider(scrapy.Spider):
    name = "localhost_8000"
    allowed_domains = ["localhost:8000"]
    start_urls = (
        'http://localhost:8000/',
    )

    def parse(self, response):
        yield scrapy.Request(response.url + 'foobar')

Results in:

2016-12-07 12:23:54 [scrapy] DEBUG: Crawled (200) <GET http://localhost:8000/> (referer: None)
2016-12-07 12:23:54 [scrapy] DEBUG: Filtered offsite request to 'localhost': <GET http://localhost:8000/foobar>
imp0wd3r commented 7 years ago

@redapple @immerrr I made the following change, and then it could work:

# scrapy/spidermiddlewares/offsite.py

def should_follow(self, request, spider):
    regex = self.host_regex
    # pre: host = urlparse_cached(request).hostname or ''
    host = urlparse_cached(request).netloc or ''
    return bool(regex.search(host))
ghost commented 6 years ago

Marking this as "Good First Issue", as there's a suggested patch already. For this to be "solved", a PR should include some additional test cases for this issue.

tensigh commented 6 years ago

Thank you, that fix worked for me, too.

TakaakiFuruse commented 6 years ago

@cathalgarvey I'm trying to make a pull request (first pull request for a big project!!) now and found that applying the patch creates another error. I think I need to research further... Here's the test log.


============================================================ FAILURES ============================================================
____________________________________________________ EngineTest.test_crawler _____________________________________________________

result = None, g = <generator object test_crawler at 0x7f396bc25780>, deferred = <Deferred at 0x7f396b6a6a28 current result: None>

    def _inlineCallbacks(result, g, deferred):
        """
        See L{inlineCallbacks}.
        """
        # This function is complicated by the need to prevent unbounded recursion
        # arising from repeatedly yielding immediately ready deferreds.  This while
        # loop and the waiting variable solve that by manually unfolding the
        # recursion.

        waiting = [True, # waiting for result?
                   None] # result

        while 1:
            try:
                # Send the last result back as the result of the yield expression.
                isFailure = isinstance(result, failure.Failure)
                if isFailure:
                    result = result.throwExceptionIntoGenerator(g)
                else:
>                   result = g.send(result)

/scrapy/.tox/py27/local/lib/python2.7/site-packages/twisted/internet/defer.py:1386:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/scrapy/tests/test_engine.py:167: in test_crawler
    self._assert_visited_urls()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <tests.test_engine.EngineTest testMethod=test_crawler>

    def _assert_visited_urls(self):
        must_be_visited = ["/", "/redirect", "/redirected",
                           "/item1.html", "/item2.html", "/item999.html"]
        urls_visited = set([rp[0].url for rp in self.run.respplug])
        urls_expected = set([self.run.geturl(p) for p in must_be_visited])
>       assert urls_expected <= urls_visited, "URLs not visited: %s" % list(urls_expected - urls_visited)
E       AssertionError: URLs not visited: ['http://localhost:42519/item2.html', 'http://localhost:42519/item999.html', 'http://localhost:42519/item1.html']

/scrapy/tests/test_engine.py:183: AssertionError
=============================================== 1 failed, 3 passed in 1.37 seconds ===============================================
ERROR: InvocationError: '/scrapy/.tox/py27/bin/py.test --cov=scrapy --cov-report= tests/test_engine.py'
____________________________________________________________ summary _____________________________________________________________
TakaakiFuruse commented 6 years ago

"/scrapy/tests/test_engine.py" is failing. It looks the the test creates spiders by using

allowed_domains = ["scrapytest.org", "localhost"] 

If I apply the suggested patch,

# scrapy/spidermiddlewares/offsite.py

def should_follow(self, request, spider):
    regex = self.host_regex
    # pre: host = urlparse_cached(request).hostname or ''
    host = urlparse_cached(request).netloc or ''
    return bool(regex.search(host))

it fails to visit some urls with port number like 'http://localhost:37089/item1.html'. Since test server is launched each time we run the test, it has different port number for local host each time. This means we cannot add fixed a port number to "localhost" in 'allowed_domain" list.

My idea is to patch like this.

# scrapy/spidermiddlewares/offsite.py

    def should_follow(self, request, spider):
        regex = self.host_regex
        # hostname can be None for wrong urls (like javascript links)
        hostname = urlparse_cached(request).hostname or ''
        netloc = urlparse_cached(request).netloc or ''
        return bool(regex.search(hostname)) or bool(regex.search(netloc))

Any ideas?

Lukas0907 commented 4 years ago

The problem is that a spider with allowed_domains = ["localhost:8080"] is not allowed to request http://localhost:8080. However, if the user wants to crawl a URL with an explict port and doesn't want to shield the spider off from other ports on the same domain, the user could use allowed_domains = ["localhost"] which does allow requests to http://localhost:8080 (i.e. this is an easy workaround) but this is probably surprising/unintuitive to users and looks like a bug.

If we want to allow port numbers in allowed_domains, we have to decide on how it changes the semantics of allowed_domains (or the notion of a domain in the context of Scrapy in general).

A domain is used in two contexts: First, it is used to check if a URL is allowed to be followed (OffsiteMiddleware), second, it is used for extracting links from a site (LinkExtractor).

(Note: The difference was introduced in b1f011d7408343aabdf89c2c25fcde9b3c2b38a2 in an attempt to fix this issue #50.)

The following table goes into more detail:

URL Domain url_is_from_any_domain() allowed_domains
http://example.com example.com True True
https://example.com example.com True True
https://example.com example.com:443 False False
http://example.com:80 example.com False True
http://example.com example.com:80 False False
http://example.com:80 example.com:80 True False
http://example.com:8080 example.com False True
http://example.com example.com:8080 False False
http://example.com:8080 example.com:8080 True False
True iff hostnames and ports are equal. True iff hostnames are equal and domain has no port (port of the URL is ignored).

We could change should_follow() to use the same semantics as url_is_from_any_domain(), however, this would be a potentially breaking change in the following case: A user scrapes example.com:8080 and sets allowed_domains = ['example.com']. The new semantics of url_is_from_any_domain() would require the user to change the spider so that the port is included in allowed_domains, i.e. allowed_domains = ['example.com:8080']. This would be a breaking change and required the user to modify their spider.

Another solution would be to only consider the port in the URL if it is also given in the domain (this is the solution proposed by @TakaakiFuruse). I.e. allowed_domains = ['example.com'] would allow to follow http://example.com and http://example.com:8080 (which is the existing behavior). allowed_domains = ['example.com:8080'] would only allow to follow http://example.com:8080 but not http://example.com or http://example.com:80. This would be backwards-compatible.

I think that the best solution, however, would be to not allow ports for allowed_domains at all. I think this makes sense because the property is called allowed_domains, i.e. the user is expected to give a list of domains. The port number is not part of the domain--it's part of the URL. Since the user might not think about the difference between a domain and a URL, there is already a check in the offsite middleware that warns if the user passes a URL:


    def get_host_regex(self, spider):
        # (...)
        url_pattern = re.compile("^https?://.*$")
        for domain in allowed_domains:
            if url_pattern.match(domain):
                message = ("allowed_domains accepts only domains, not URLs. "
                           "Ignoring URL entry %s in allowed_domains." % domain)
                warnings.warn(message, URLWarning)
        domains = [re.escape(d) for d in allowed_domains if d is not None]
        regex = r'^(.*\.)?(%s)$' % '|'.join(domains)
        return re.compile(regex)

We could now also issue a warning if a domain with a port number is added and ignore the domain---just like how it is done for URLs. I have prepared PR #4413 to fix this issue.

nyov commented 4 years ago

I agree with @Lukas0907 on the principle here: allowed_domains handles domains. It don't currently see a reason to include any ports. allowed_domains is not the start_urls setting.

Behaviour: allow any ports; don't check against netloc, but against hostname property.

What sensible reason is there to restrict OffsiteMiddleware to ports? Allow port 443 but not port 80 links? Is this kind of granularity useful?