Open immerrr opened 9 years ago
Another point to consider: what if there's a redirection loop? Limit the redirection chain? Also, how to report this to the user?
for redirection loops there is http://doc.scrapy.org/en/latest/topics/settings.html#redirect-max-times
//cc @sibiryakov - this is related to 'canonical URLs' feature you're working on for frontera; how are you solving it there?
Encountered this myself. To clarify:
In [1]: from scrapy.utils.request import request_fingerprint
In [2]: import scrapy
In [3]: req1 = scrapy.Request('http://scrapy.org/foo')
In [4]: req2 = scrapy.Request('http://scrapy.org/foo/')
In [5]: request_fingerprint(req2) == request_fingerprint(req1)
Out[5]: False
In [6]: req1 = scrapy.Request('http://scrapy.org/')
In [7]: req2 = scrapy.Request('http://scrapy.org')
In [8]: request_fingerprint(req2) == request_fingerprint(req1)
Out[8]: True
So the issue shows itself when there is a redirect on the main website page to add or remove a slash.
It is not that visible because such requests are most often sent from start_requests
or start_urls
where dont_filter=True
flag is used. It also works in scrapy shell, for the same reason - scrapy shell uses dont_filter=True
flag in fetch
command.
+1 to solve it somehow, I think it is an important issue. We are silently[1] dropping some requests during broad crawls when links to crawl are extracted from the web pages.
[1] dupefilter only logs the first duplicate request
Hello, I am working with @jlbren and @creider, and we can't seem to be able to reproduce this issue. The dupefilter does not seem to filter out requests if they are redirects. We set up a few simple servers that redirected to each other, or to another website which redirected to itself once, and scrapy kept requesting the websites until it reached its redirect count limit, but no requests seemed to have been ignored. Is this issue fixed?
The scrapy output is available here (for a server redirecting to itself): http://pastebin.com/BvYPjJie
Here's a simple example to reproduce. A bit contrived as it plays around query parameter ordering, but does the job: a browser can open the page, and the spider cannot
# -*- coding: utf-8 -*-
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
def start_requests(self):
yield scrapy.Request(
'http://localhost:8000/?foo=bar&qux=quux',
callback=self.parse)
def parse(self, response):
pass
import BaseHTTPServer
from SimpleHTTPServer import SimpleHTTPRequestHandler
class RequestHandler(SimpleHTTPRequestHandler):
protocol_version = "HTTP/1.1"
def do_GET(self):
if self.path == '/?foo=bar&qux=quux':
self.send_response(302)
self.send_header('Location', '/?qux=quux&foo=bar')
self.send_header('Content-Length', 0)
self.end_headers()
return b''
msg = b'Hello world'
self.send_response(200)
self.send_header('Content-Length', len(msg))
self.end_headers()
self.wfile.write(msg)
self.finish()
return msg
if __name__ == '__main__':
httpd = BaseHTTPServer.HTTPServer(('127.0.0.1', 8000), RequestHandler)
sa = httpd.socket.getsockname()
print("Serving HTTP on", sa[0], "port", sa[1], "...")
httpd.serve_forever()
OK, I was able to reproduce the issue, but I am not sure that this is an issue, as the comment block in the request_fingerprint() method on line 19 in scrapy/utils/request.py specifically mentions this exact situation, and it says that the order should not matter because the URL's should be pointing to the same resource on the server. I think the only way to fix the issue stated above (with query parameter orders being different) would be to remove the call to canonicalize_url from line 53, but I'm not sure if this change should be made, as it would contradict the comments that are already there.
@lborowczak this is not necessarily an issue with canonicalize_url, it looks correct. The problem is that just using canonicalize_url makes Scrapy drop requests which shouldn't be dropped in case of redirects, this is what should be fixed.
So ideally, we should drop requests which are seen according to existing canonicalize_url, with an exception of redirects. But this is tricky; passing dont_filter on redirect requests is not a solution because we still want to filter out duplicate redirect targets, but not if their only duplicate URL is an URL redirect happens from.
@lborowczak the issue is not about canonicalize_url
, it could be a certain site that redirects you to put some session cookie to a different page, and then gets redirected back, that's not much of a problem, it was just a simple way to reproduce it.
I think the problem lies in the implementation: dupefilter filters out requests under the assumption that the URL is already visited and crawled, but we mark "requests" as seen, not "responses". See the difference? Redirects just reveals the issue.
Consider the following situation. You issue a certain "redirect-loop" request which has not been seen before, the request gets to the workflow, is marked as seen, gets a redirect response and the follow up request gets dropped by dupefilter. What's the outcome? The request is marked seen, but the response is never seen by the spider, neither callback not errback was called, because dupefilter drops stuff silently.
Thing is, if you mark responses as "seen", you can get multiple "parallel" requests scheduled for the same URL, which is also suboptimal, so ideally dupefilter should mark unseen requests in progress with some ID and put that ID to the requests_seen
dict so that there are no concurrent requests for the same URL but the request chain marked with that certain ID can still try to reach that URL (as long as REDIRECT_MAX_TIMES
is not reached) and convert the request_seen
record into a permanent record (response_seen
?).
@kmike you beat me to it. again :)
I just stumbled over this issue myself and can only confirm the remarks of @immerrr.
In my case I was crawling a website which contains some pages with redirection loops. Normally i would assume that the RedirectMiddleware removes and reports the request after the maximum number of redirections is reached. But with enabled dupefilter there was no such behavior. The dupefilter was simply dropping the request and reporting it as filtered. With this behavior the real problem was masked and I was wondering why there are pages missing without reported warnings. Without dupefilter everything was reported as expected although the pages with redirection loops of course were still not reachable.
I have met this problem too. In my case, I visit a url: http://xxx.com/aaa.html, for the first turn, the website redirect me to the login page at: http://login.xxx.com/redirect?url=http://xxx.com/aaa.html, in the login page, the website will write some cookies to scrapy and then redirect to the origin page again automatically. but in this time, the target url will be filtered because is has been visited at the first time.
@kingname can you workaround it by using dont_filter=True?
Surely I can and I have done. But I think it is a bug and you should solve it. Why don't you solve it for almost 3 years?
I know in some situations, the anti-spider system will always redirect the spider to one page to protect their data, so this duplicate filter is reasonable. But are there some smarter method to solve my problem?
@kmike what if the scheduler consults request.meta['redirect_urls']
and doesn't filter requests that have request.url in request.meta['redirect_urls'][:-1]
?
It will kinda introduce this implicit requirement on the functionality of redirect middlewares, that they will now be required to populate redirect_urls
correctly for this to function, but there should be no backward compatibility issues, right?
@kingname Yeah, that's a bug, and it should be solved. There are 300+ open issues in the bug tracker, and, as any other large open source project, we should prioritize what to work on ourselves, and help community contribute if their priorities are different than ours.
Your bug doesn't look the same as described in this issue though. As I understand, it is not caused by a redirect to the same page which happens automatically, but by a redirect which happens after submitting a form, right? The fix would be different in this case, probably something along the lines of https://github.com/scrapy/scrapy/issues/900 which would allow to add specific cookies to request fingerprint.
Existing dont_filter
solution doesn't look too bad to me as well, it is common to use dont_filter for such requests.
@immerrr sounds good to me 👍 I think relying on a specific meta key is reasonable; if it is not populated we can consider it empty, and it'd work as before. Infinite redirect loops to the same page would be handled by REDIRECT_MAX_TIMES option instead of a dupefilter (we may need to mention that in the release notes).
@kmike In fact, it is not after submittiing a form. the login page just write cookies to spider and then redirect back again automatically. I do not submit anything. just like A->B->A.
I want to crawl A page, but the website redirect me to another page first and then redirect back to A page again in a second.
Yes, dont_filter
is one of the solution.
Came here from this SO thread. If your start_url has a 301 redirect, it pretty much closes the spider with one URL crawl.
I've faced this problem today while writing a spider like this:
FormRequest
to website.com/login for each credential with dont_filter=True
Workaround:
Since this Spider was very simple, I've just disabled my dupefilter with:
class WebsiteSpider(scrapy.Spider):
...
custom_settings = {
"DUPEFILTER_CLASS": "scrapy.dupefilters.BaseDupeFilter",
}
....
But it would be very interesting to have a way to disable this behavior like what's being proposed on #4314 or something like dont_filter_redirects=True
.
Current solution based on https://github.com/scrapy/scrapy/pull/4314, and assuming the use of the default redirect downloader middleware and duplicate filter:
# settings.py
from logging import getLogger
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware as _RedirectMiddleware
from scrapy.dupefilters import RFPDupeFilter
from scrapy.exceptions import IgnoreRequest
from scrapy.utils.request import request_fingerprint
logger = getLogger(__name__)
class DupeFilter(RFPDupeFilter):
"""Fork of the Scrapy duplicate filter patched with
https://github.com/scrapy/scrapy/pull/4314"""
def request_seen(self, request):
fp = self.request_fingerprint(request)
if fp in self.fingerprints:
redirect_fps = request.meta.get('redirect_fingerprints', set())
return fp not in redirect_fps
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + '\n')
class RedirectMiddleware(_RedirectMiddleware):
"""Fork of the Scrapy redirect middleware patched with
https://github.com/scrapy/scrapy/pull/4314"""
def _redirect(self, redirected, request, spider, reason):
ttl = request.meta.setdefault('redirect_ttl', self.max_redirect_times)
redirects = request.meta.get('redirect_times', 0) + 1
if ttl and redirects <= self.max_redirect_times:
redirected.meta['redirect_times'] = redirects
redirected.meta['redirect_ttl'] = ttl - 1
redirected.meta['redirect_urls'] = request.meta.get('redirect_urls', []) + [request.url]
redirected.meta['redirect_reasons'] = request.meta.get('redirect_reasons', []) + [reason]
fingerprints = request.meta.get('redirect_fingerprints', set())
fingerprint = request_fingerprint(request)
redirected.meta['redirect_fingerprints'] = fingerprints | {fingerprint}
redirected.dont_filter = request.dont_filter
redirected.priority = request.priority + self.priority_adjust
logger.debug("Redirecting (%(reason)s) to %(redirected)s from %(request)s",
{'reason': reason, 'redirected': redirected, 'request': request},
extra={'spider': spider})
return redirected
else:
logger.debug("Discarding %(request)s: max redirections reached",
{'request': request}, extra={'spider': spider})
raise IgnoreRequest("max redirections reached")
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.redirect.RedirectMiddleware": None,
RedirectMiddleware: 600,
}
DUPEFILTER_CLASS = DupeFilter
Hit this when trying to run a spider against scrapinghub.com: sometimes it responds with
302 moved permanently to scrapinghub.com
. Scheduler agrees and tries to schedule another request for scrapinghub.com, but fails because dupefilter already considers it visited.Maybe dupefilter should only add hosts when the response is not a redirect? And when it is, the scheduler should probably remember the original address too, so that all the redirection chain can be marked as visited.