scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Security enhancement when following a "redirect" #457

Closed mvsantos closed 4 days ago

mvsantos commented 10 years ago

I believe this is not a bug, but could fit in as a security enhancement.

TL;DR When following HTTP redirects, scrapy should only follow http/https requests. Quite possibly similar to (CVE-2009-0037) http://curl.haxx.se/docs/adv_20090303.html

Scenario: a. A spider that simply stores raw or minimally sanitised data. b. By default scrapy follows HTTP redirects. c. The values of both DOWNLOAD_HANDLERS and DOWNLOAD_HANDLERS_BASE are set to their default.

DOWNLOAD_HANDLERS = {}
DOWNLOAD_HANDLERS_BASE = {
    'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
    'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
    'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
    's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
    'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
}

And let's say a malicious client might issue a redirect such as

# ==== .htaccess ====
RewriteEngine On
Redirect 301 /trap.html  file:///etc/passwd
# ==== ENDof .htaccess ====

And so, my spider will be "leaking" my own "/etc/passwd" file because Scrapy followed the 301 redirect from "http:// ... /trap.html" to "file:///etc/passwd".

Shouldn't scrapy enforce some sort of protocol restriction so that it only follows the http/https protocols when following redirects?

pablohoffman commented 10 years ago

The explanation makes sense, but what about pages containing links to file:///etc/passwd directly on their HTML?

mvsantos commented 10 years ago

I think that by default file:// should not be scraped nor indexed, but since access to local files might be useful when writing tests and contracts a new setting key could be introduced to scrapy's core. That key/value setting would define whether the scraping of file:// is allowed or not. Therefore, leaving it up to the developer writing spiders to override that default value and allow at hers/his own risk the scraping and indexing of file:// .

And regarding the initial point, I believe that "redirects" should never follow file:// .

barraponto commented 10 years ago

Is there any situation where a remote file:/// link is useful for scraping?

mvsantos commented 10 years ago

@barraponto I don't believe file:// is ever of any use, except when testing local files and there's no http server running. I also believe that those use cases are so rare, and have easy natural alternatives, therefore having file:// enabled by default is a security risk.

barraponto commented 10 years ago

That's what I think as well. But I'd like to hear what @pablohoffman or @dangra have to say about it.

dangra commented 10 years ago

@mvsantos: I understand the issue but must say Scrapy defaults are set to ease development, and tuned in production in a per case basis.

As a general rule It's recommended to run spiders in constrained environments, limited in memory, CPU and filesystem access, and this environments makes access to local files a derisory issue.

Disabling file:// by default will become a common gotcha for local development. What about adding a "Recommendation for running spiders in production" page to docs?

mvsantos commented 10 years ago

@dangra I understand and appreciate your point about Scrapy's approach to ease development. Your suggestion of a "must read" page sounds good.

I understand that Scrapy's maintainers want to make it as easy as possible for newcomers. And, we (scrapy users) are expected to do our homework when it comes to our enviroment's security. But features that pose possible security risks and are not required by the system's core, should be enabled if required, not by default. So, in my opinion leaving file:// enabled by default, is far from a derisory issue that you claim. Please note that I don't want to sound like a scaremonger, but right now Scrapy's is leaving inexperienced users at risk of possible privilege escalation‎ attacks.

dangra commented 10 years ago

Just gave a re-read to the whole issue. I think the initial proposal of disallowing http redirections to non-http urls makes sense.

Even if file:// urls are allowed by default Scrapy settings, links extracted from HTML pages won't work because of OffsiteMiddleware (enabled by default too).

file:// requests are created on purpose when using scrapy shell or sometimes from start_urls and that will continue to work for developers.

barraponto commented 10 years ago

@dangra Do you mean they shouldn't work from HTML pages because of OffsiteMiddleware? Or is that the current behavior?

dangra commented 10 years ago

That is the current behavior when allowed_domains spider attribute is set (default spider templates does)

I admit this is not the strong security-by-default you are expecting but It minimizes this issue for newbies again.

@kmike, @redapple, @nramirezuy what do you think?

kmike commented 10 years ago

Following file:// links on http[s]:// pages looks wrong. Following them on file:// pages or having them in start_urls looks fine.

Also, I don't quite get the scope of this issue: how will attacker get the contents of /etc/passwd even if spider downloaded it? Is it assumed that spider has a code that submits downloaded response back to attacker's server (maybe via FormRequest or something like that), or there are other scenarios when downloading file:// files is bad?

kmike commented 10 years ago

Ah, I've read the CVE and it is clearer now, please disregard my question.

What about adding allowed_protocols argument to Request constructor? It'll be without file by default, but the default start_requests implementation will create Requests with file included in allowed protocols. This way start_urls will work.

kmike commented 10 years ago

Protocol checking should be done outside Request constructor because unsupported link shouldn't cause an exception in response handler.

mvsantos commented 10 years ago

@kmike Any service that scrapes content and outputs near-raw data could be targeted by an attacker. Examples, but not limited to:

Any of these services could be powered by Scrapy, and an attacker could exploit them by creating traps such as <a href="file:///etc/passwd">foo</a> so that once the service iterates a sitemap or downloads whatever page where that trap is deployed to, the target spits out the data scraped.

Just to clarify, the initial point I raised: A. HTTP redirects and html meta refresh, should only work with http[s]. So far, looks like we all agree with this one. B. Some kind of master switch could be introduced to the settings file, making it clear that allowing requests to file:// may result in unwanted results if OffsiteMiddleware is not defined or disabled or open to any domains.

redapple commented 10 years ago

I wouldn't mind having a DEBUG or TEST_MODE setting in settings.py similar to Django's DEBUG:

Additionally, could file:// be authorized by default in only specific commands like scrapy shell or scrapy parse?

mvsantos commented 10 years ago

@redapple IMO this problem asks for a setting of its own, since there're security implications and therefore beyond the DEBUG scope.

redapple commented 10 years ago

@mvsantos my comment above suggested a name, but it could be more explicit indeed. How do you feel about the 3 things this setting would do/authorize, whatever its name?

mvsantos commented 10 years ago

@redapple I personally don't like the idea of having security settings open by default, like you suggest in the second topic. But Scrapy's approach is clearly making as easy as possible to get the first "HelloWorld" spider working, then your 3 topics should work fine.

Also, @dangra idea of adding a "Recommendation for running spiders in production" page to docs is great for newcomers like myself, to make sure we deploy Scrapy in a sane and secure fashion.

nramirezuy commented 10 years ago

@dangra you are not protected by Offsite on redirects. But as @kmike said you need to send the file back to the server to be an issue.

dangra commented 10 years ago

@nramirezuy: OffsiteMW is not a great "protection", but to put my words to work, it won't let file:// urls pass:

diff --git a/scrapy/tests/test_spidermiddleware_offsite.py b/scrapy/tests/test_spidermiddleware_offsite.py
index f7523f7..2e58827 100644
--- a/scrapy/tests/test_spidermiddleware_offsite.py
+++ b/scrapy/tests/test_spidermiddleware_offsite.py
@@ -23,7 +23,8 @@ class TestOffsiteMiddleware(TestCase):
                        Request('http://sub.scrapy.org/1'),
                        Request('http://offsite.tld/letmepass', dont_filter=True)]
         offsite_reqs = [Request('http://scrapy2.org'),
-                       Request('http://offsite.tld/')]
+                        Request('http://offsite.tld/'),
+                        Request('file:///etc/passwd')]
         reqs = onsite_reqs + offsite_reqs

         out = list(self.mw.process_spider_output(res, reqs, self.spider))

test output:

$ trial scrapy/tests/test_spidermiddleware_offsite.py
scrapy.tests.test_spidermiddleware_offsite
  TestOffsiteMiddleware
    test_process_spider_output ...                                         [OK]
  TestOffsiteMiddleware2
    test_process_spider_output ...                                         [OK]
  TestOffsiteMiddleware3
    test_process_spider_output ...                                         [OK]

-------------------------------------------------------------------------------
Ran 3 tests in 0.057s

PASSED (successes=3)
dangra commented 10 years ago

@nramirezuy : OffisteMW doesn't apply to redirects, right, but I was mentioning it in the context of links extracted from html pages. The idea with redirects is to disallow others than https?://.

mvsantos commented 10 years ago

Hi folks, any updates regarding this issue?

EDIT: here's what might happen when follow redirect (from http:// to file://) is a default behavior http://blog.detectify.com/post/82370846588/how-we-got-read-access-on-googles-production-servers

kmike commented 10 years ago

I think that file:// links are fine only when it is developer who wrote them. Most common case (if not the only) for that is start_requests (and start_urls). It is convenient to have file:// support in start_requests, but IMHO they should be disabled everywhere else.

I don't like solutions with some global switch (be it a setting or a middleware) for several reasons:

1) developer should not forget to make this switch - it is a security issue itself; 2) there are cases when OffsiteMiddleware should be off, and we shouldn't have a security issue in this case; 3) explicitly written file:// could be useful even in production (e.g. a sitemap bundled with a spider).

https://github.com/scrapy/scrapy/issues/457#issuecomment-28919683 solution doesn't have these problems. There is a gotcha with redirects where allowed_protocols should be reset. I wonder if it even worths to make Request.replace take care of it, resetting alowed_protocols where url is changed (+ copy existing replace to Request._replace_unsafe).

redapple commented 9 years ago

I'm ok with @kmike's allowed_protocols proposal

RedirectMiddleware needs other changes too, for example removing proxy key from Request.meta when scheme changes, and also removing some HTTP headers (get rid of proxy-specific header for example) in redirected requests. This is a good opportunity to clean/tighten it up.

What's the core maintainers' take on this? CC @dangra , @nramirezuy , @pablohoffman , @kmike

nramirezuy commented 9 years ago

@redapple I don't know if removing the proxy is a good idea, also the cookies may not change. Did you checked how the browser handle it?

redapple commented 9 years ago

@nramirezuy , if the redirected URL of a proxied request changes scheme, the proxy doesn't change (e.g. use an HTTPS proxy for an http:// URL), that feels weird and wrong, and python-requests doesn't do that. My comment was to remove the "proxy" key, so it's added in the rest of the downloader chain later if necessary. That's not strictly part of this issue.

Gallaecio commented 4 days ago

https://github.com/scrapy/scrapy/security/advisories/GHSA-23j4-mw76-5v7h