scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

dupefilter skips a request when a page is redirected to itself #1225

Open immerrr opened 9 years ago

immerrr commented 9 years ago

Hit this when trying to run a spider against scrapinghub.com: sometimes it responds with 302 moved permanently to scrapinghub.com. Scheduler agrees and tries to schedule another request for scrapinghub.com, but fails because dupefilter already considers it visited.

Maybe dupefilter should only add hosts when the response is not a redirect? And when it is, the scheduler should probably remember the original address too, so that all the redirection chain can be marked as visited.

immerrr commented 9 years ago

Another point to consider: what if there's a redirection loop? Limit the redirection chain? Also, how to report this to the user?

kmike commented 9 years ago

for redirection loops there is http://doc.scrapy.org/en/latest/topics/settings.html#redirect-max-times

kmike commented 9 years ago

//cc @sibiryakov - this is related to 'canonical URLs' feature you're working on for frontera; how are you solving it there?

kmike commented 8 years ago

Encountered this myself. To clarify:

In [1]: from scrapy.utils.request import request_fingerprint
In [2]: import scrapy
In [3]: req1 = scrapy.Request('http://scrapy.org/foo')
In [4]: req2 = scrapy.Request('http://scrapy.org/foo/')
In [5]: request_fingerprint(req2) == request_fingerprint(req1)
Out[5]: False

In [6]: req1 = scrapy.Request('http://scrapy.org/')
In [7]: req2 = scrapy.Request('http://scrapy.org')
In [8]: request_fingerprint(req2) == request_fingerprint(req1)
Out[8]: True

So the issue shows itself when there is a redirect on the main website page to add or remove a slash.

It is not that visible because such requests are most often sent from start_requests or start_urls where dont_filter=True flag is used. It also works in scrapy shell, for the same reason - scrapy shell uses dont_filter=True flag in fetch command.

+1 to solve it somehow, I think it is an important issue. We are silently[1] dropping some requests during broad crawls when links to crawl are extracted from the web pages.

[1] dupefilter only logs the first duplicate request

lborowczak commented 7 years ago

Hello, I am working with @jlbren and @creider, and we can't seem to be able to reproduce this issue. The dupefilter does not seem to filter out requests if they are redirects. We set up a few simple servers that redirected to each other, or to another website which redirected to itself once, and scrapy kept requesting the websites until it reached its redirect count limit, but no requests seemed to have been ignored. Is this issue fixed?

The scrapy output is available here (for a server redirecting to itself): http://pastebin.com/BvYPjJie

immerrr commented 7 years ago

Here's a simple example to reproduce. A bit contrived as it plays around query parameter ordering, but does the job: a browser can open the page, and the spider cannot

spider

# -*- coding: utf-8 -*-
import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"

    def start_requests(self):
        yield scrapy.Request(
            'http://localhost:8000/?foo=bar&qux=quux',
            callback=self.parse)

    def parse(self, response):
        pass

redirecting server

import BaseHTTPServer
from SimpleHTTPServer import SimpleHTTPRequestHandler

class RequestHandler(SimpleHTTPRequestHandler):
    protocol_version = "HTTP/1.1"

    def do_GET(self):
        if self.path == '/?foo=bar&qux=quux':
            self.send_response(302)
            self.send_header('Location', '/?qux=quux&foo=bar')
            self.send_header('Content-Length', 0)
            self.end_headers()
            return b''

        msg = b'Hello world'
        self.send_response(200)
        self.send_header('Content-Length', len(msg))
        self.end_headers()
        self.wfile.write(msg)
        self.finish()
        return msg

if __name__ == '__main__':
    httpd = BaseHTTPServer.HTTPServer(('127.0.0.1', 8000), RequestHandler)
    sa = httpd.socket.getsockname()
    print("Serving HTTP on", sa[0], "port", sa[1], "...")
    httpd.serve_forever()
lborowczak commented 7 years ago

OK, I was able to reproduce the issue, but I am not sure that this is an issue, as the comment block in the request_fingerprint() method on line 19 in scrapy/utils/request.py specifically mentions this exact situation, and it says that the order should not matter because the URL's should be pointing to the same resource on the server. I think the only way to fix the issue stated above (with query parameter orders being different) would be to remove the call to canonicalize_url from line 53, but I'm not sure if this change should be made, as it would contradict the comments that are already there.

kmike commented 7 years ago

@lborowczak this is not necessarily an issue with canonicalize_url, it looks correct. The problem is that just using canonicalize_url makes Scrapy drop requests which shouldn't be dropped in case of redirects, this is what should be fixed.

So ideally, we should drop requests which are seen according to existing canonicalize_url, with an exception of redirects. But this is tricky; passing dont_filter on redirect requests is not a solution because we still want to filter out duplicate redirect targets, but not if their only duplicate URL is an URL redirect happens from.

immerrr commented 7 years ago

@lborowczak the issue is not about canonicalize_url, it could be a certain site that redirects you to put some session cookie to a different page, and then gets redirected back, that's not much of a problem, it was just a simple way to reproduce it.

I think the problem lies in the implementation: dupefilter filters out requests under the assumption that the URL is already visited and crawled, but we mark "requests" as seen, not "responses". See the difference? Redirects just reveals the issue.

Consider the following situation. You issue a certain "redirect-loop" request which has not been seen before, the request gets to the workflow, is marked as seen, gets a redirect response and the follow up request gets dropped by dupefilter. What's the outcome? The request is marked seen, but the response is never seen by the spider, neither callback not errback was called, because dupefilter drops stuff silently.

Thing is, if you mark responses as "seen", you can get multiple "parallel" requests scheduled for the same URL, which is also suboptimal, so ideally dupefilter should mark unseen requests in progress with some ID and put that ID to the requests_seen dict so that there are no concurrent requests for the same URL but the request chain marked with that certain ID can still try to reach that URL (as long as REDIRECT_MAX_TIMES is not reached) and convert the request_seen record into a permanent record (response_seen?).

immerrr commented 7 years ago

@kmike you beat me to it. again :)

ghost commented 6 years ago

I just stumbled over this issue myself and can only confirm the remarks of @immerrr.

In my case I was crawling a website which contains some pages with redirection loops. Normally i would assume that the RedirectMiddleware removes and reports the request after the maximum number of redirections is reached. But with enabled dupefilter there was no such behavior. The dupefilter was simply dropping the request and reporting it as filtered. With this behavior the real problem was masked and I was wondering why there are pages missing without reported warnings. Without dupefilter everything was reported as expected although the pages with redirection loops of course were still not reachable.

kingname commented 6 years ago

I have met this problem too. In my case, I visit a url: http://xxx.com/aaa.html, for the first turn, the website redirect me to the login page at: http://login.xxx.com/redirect?url=http://xxx.com/aaa.html, in the login page, the website will write some cookies to scrapy and then redirect to the origin page again automatically. but in this time, the target url will be filtered because is has been visited at the first time.

kmike commented 6 years ago

@kingname can you workaround it by using dont_filter=True?

kingname commented 6 years ago

Surely I can and I have done. But I think it is a bug and you should solve it. Why don't you solve it for almost 3 years?

I know in some situations, the anti-spider system will always redirect the spider to one page to protect their data, so this duplicate filter is reasonable. But are there some smarter method to solve my problem?

immerrr commented 6 years ago

@kmike what if the scheduler consults request.meta['redirect_urls'] and doesn't filter requests that have request.url in request.meta['redirect_urls'][:-1]?

It will kinda introduce this implicit requirement on the functionality of redirect middlewares, that they will now be required to populate redirect_urls correctly for this to function, but there should be no backward compatibility issues, right?

kmike commented 6 years ago

@kingname Yeah, that's a bug, and it should be solved. There are 300+ open issues in the bug tracker, and, as any other large open source project, we should prioritize what to work on ourselves, and help community contribute if their priorities are different than ours.

Your bug doesn't look the same as described in this issue though. As I understand, it is not caused by a redirect to the same page which happens automatically, but by a redirect which happens after submitting a form, right? The fix would be different in this case, probably something along the lines of https://github.com/scrapy/scrapy/issues/900 which would allow to add specific cookies to request fingerprint.

Existing dont_filter solution doesn't look too bad to me as well, it is common to use dont_filter for such requests.

@immerrr sounds good to me 👍 I think relying on a specific meta key is reasonable; if it is not populated we can consider it empty, and it'd work as before. Infinite redirect loops to the same page would be handled by REDIRECT_MAX_TIMES option instead of a dupefilter (we may need to mention that in the release notes).

kingname commented 6 years ago

@kmike In fact, it is not after submittiing a form. the login page just write cookies to spider and then redirect back again automatically. I do not submit anything. just like A->B->A.

I want to crawl A page, but the website redirect me to another page first and then redirect back to A page again in a second.

Yes, dont_filter is one of the solution.

incognitozen commented 3 years ago

Came here from this SO thread. If your start_url has a 301 redirect, it pretty much closes the spider with one URL crawl.

victor-torres commented 3 years ago

I've faced this problem today while writing a spider like this:

Workaround:

Since this Spider was very simple, I've just disabled my dupefilter with:

class WebsiteSpider(scrapy.Spider):

    ...

    custom_settings = {
        "DUPEFILTER_CLASS": "scrapy.dupefilters.BaseDupeFilter",
    }

    ....

But it would be very interesting to have a way to disable this behavior like what's being proposed on #4314 or something like dont_filter_redirects=True.

Gallaecio commented 1 year ago

Current solution based on https://github.com/scrapy/scrapy/pull/4314, and assuming the use of the default redirect downloader middleware and duplicate filter:

# settings.py
from logging import getLogger

from scrapy.downloadermiddlewares.redirect import RedirectMiddleware as _RedirectMiddleware
from scrapy.dupefilters import RFPDupeFilter
from scrapy.exceptions import IgnoreRequest
from scrapy.utils.request import request_fingerprint

logger = getLogger(__name__)

class DupeFilter(RFPDupeFilter):
    """Fork of the Scrapy duplicate filter patched with
    https://github.com/scrapy/scrapy/pull/4314"""

    def request_seen(self, request):
        fp = self.request_fingerprint(request)
        if fp in self.fingerprints:
            redirect_fps = request.meta.get('redirect_fingerprints', set())
            return fp not in redirect_fps
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + '\n')

class RedirectMiddleware(_RedirectMiddleware):
    """Fork of the Scrapy redirect middleware patched with
    https://github.com/scrapy/scrapy/pull/4314"""

    def _redirect(self, redirected, request, spider, reason):
        ttl = request.meta.setdefault('redirect_ttl', self.max_redirect_times)
        redirects = request.meta.get('redirect_times', 0) + 1
        if ttl and redirects <= self.max_redirect_times:
            redirected.meta['redirect_times'] = redirects
            redirected.meta['redirect_ttl'] = ttl - 1
            redirected.meta['redirect_urls'] = request.meta.get('redirect_urls', []) + [request.url]
            redirected.meta['redirect_reasons'] = request.meta.get('redirect_reasons', []) + [reason]
            fingerprints = request.meta.get('redirect_fingerprints', set())
            fingerprint = request_fingerprint(request)
            redirected.meta['redirect_fingerprints'] = fingerprints | {fingerprint}
            redirected.dont_filter = request.dont_filter
            redirected.priority = request.priority + self.priority_adjust
            logger.debug("Redirecting (%(reason)s) to %(redirected)s from %(request)s",
                         {'reason': reason, 'redirected': redirected, 'request': request},
                         extra={'spider': spider})
            return redirected
        else:
            logger.debug("Discarding %(request)s: max redirections reached",
                         {'request': request}, extra={'spider': spider})
            raise IgnoreRequest("max redirections reached")

DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": None,
    RedirectMiddleware: 600,
}
DUPEFILTER_CLASS = DupeFilter