Closed mvsantos closed 4 days ago
The explanation makes sense, but what about pages containing links to file:///etc/passwd
directly on their HTML?
I think that by default file:// should not be scraped nor indexed, but since access to local files might be useful when writing tests and contracts a new setting key could be introduced to scrapy's core. That key/value setting would define whether the scraping of file:// is allowed or not. Therefore, leaving it up to the developer writing spiders to override that default value and allow at hers/his own risk the scraping and indexing of file:// .
And regarding the initial point, I believe that "redirects" should never follow file:// .
Is there any situation where a remote file:///
link is useful for scraping?
@barraponto I don't believe file:// is ever of any use, except when testing local files and there's no http server running. I also believe that those use cases are so rare, and have easy natural alternatives, therefore having file:// enabled by default is a security risk.
That's what I think as well. But I'd like to hear what @pablohoffman or @dangra have to say about it.
@mvsantos: I understand the issue but must say Scrapy defaults are set to ease development, and tuned in production in a per case basis.
As a general rule It's recommended to run spiders in constrained environments, limited in memory, CPU and filesystem access, and this environments makes access to local files a derisory issue.
Disabling file://
by default will become a common gotcha for local development.
What about adding a "Recommendation for running spiders in production" page to docs?
@dangra I understand and appreciate your point about Scrapy's approach to ease development. Your suggestion of a "must read" page sounds good.
I understand that Scrapy's maintainers want to make it as easy as possible for newcomers.
And, we (scrapy users) are expected to do our homework when it comes to our enviroment's security.
But features that pose possible security risks and are not required by the system's core, should be enabled if required, not by default.
So, in my opinion leaving file://
enabled by default, is far from a derisory issue that you claim. Please note that I don't want to sound like a scaremonger, but right now Scrapy's is leaving inexperienced users at risk of possible privilege escalation attacks.
Just gave a re-read to the whole issue. I think the initial proposal of disallowing http redirections to non-http urls makes sense.
Even if file://
urls are allowed by default Scrapy settings, links extracted from HTML pages won't work because of OffsiteMiddleware
(enabled by default too).
file://
requests are created on purpose when using scrapy shell
or sometimes from start_urls
and that will continue to work for developers.
@dangra Do you mean they shouldn't work from HTML pages because of OffsiteMiddleware
? Or is that the current behavior?
That is the current behavior when allowed_domains
spider attribute is set (default spider templates does)
I admit this is not the strong security-by-default you are expecting but It minimizes this issue for newbies again.
@kmike, @redapple, @nramirezuy what do you think?
Following file://
links on http[s]://
pages looks wrong. Following them on file://
pages or having them in start_urls
looks fine.
Also, I don't quite get the scope of this issue: how will attacker get the contents of /etc/passwd
even if spider downloaded it? Is it assumed that spider has a code that submits downloaded response back to attacker's server (maybe via FormRequest or something like that), or there are other scenarios when downloading file://
files is bad?
Ah, I've read the CVE and it is clearer now, please disregard my question.
What about adding allowed_protocols
argument to Request constructor? It'll be without file
by default, but the default start_requests
implementation will create Requests with file
included in allowed protocols. This way start_urls
will work.
Protocol checking should be done outside Request constructor because unsupported link shouldn't cause an exception in response handler.
@kmike Any service that scrapes content and outputs near-raw data could be targeted by an attacker. Examples, but not limited to:
Any of these services could be powered by Scrapy, and an attacker could exploit them by creating traps such as <a href="file:///etc/passwd">foo</a>
so that once the service iterates a sitemap or downloads whatever page where that trap is deployed to, the target spits out the data scraped.
Just to clarify, the initial point I raised:
A. HTTP redirects and html meta refresh, should only work with http[s]. So far, looks like we all agree with this one.
B. Some kind of master switch could be introduced to the settings file, making it clear that allowing requests to file://
may result in unwanted results if OffsiteMiddleware
is not defined or disabled or open to any domains.
I wouldn't mind having a DEBUG
or TEST_MODE
setting in settings.py
similar to Django's DEBUG
:
file://
if set to True
, with a default value of False
True
when settings.py
is created with scrapy startproject
,Additionally, could file://
be authorized by default in only specific commands like scrapy shell
or scrapy parse
?
@redapple IMO this problem asks for a setting of its own, since there're security implications and therefore beyond the DEBUG
scope.
@mvsantos my comment above suggested a name, but it could be more explicit indeed. How do you feel about the 3 things this setting would do/authorize, whatever its name?
@redapple I personally don't like the idea of having security settings open by default, like you suggest in the second topic. But Scrapy's approach is clearly making as easy as possible to get the first "HelloWorld" spider working, then your 3 topics should work fine.
Also, @dangra idea of adding a "Recommendation for running spiders in production" page to docs is great for newcomers like myself, to make sure we deploy Scrapy in a sane and secure fashion.
@dangra you are not protected by Offsite on redirects. But as @kmike said you need to send the file back to the server to be an issue.
@nramirezuy: OffsiteMW is not a great "protection", but to put my words to work, it won't let file://
urls pass:
diff --git a/scrapy/tests/test_spidermiddleware_offsite.py b/scrapy/tests/test_spidermiddleware_offsite.py
index f7523f7..2e58827 100644
--- a/scrapy/tests/test_spidermiddleware_offsite.py
+++ b/scrapy/tests/test_spidermiddleware_offsite.py
@@ -23,7 +23,8 @@ class TestOffsiteMiddleware(TestCase):
Request('http://sub.scrapy.org/1'),
Request('http://offsite.tld/letmepass', dont_filter=True)]
offsite_reqs = [Request('http://scrapy2.org'),
- Request('http://offsite.tld/')]
+ Request('http://offsite.tld/'),
+ Request('file:///etc/passwd')]
reqs = onsite_reqs + offsite_reqs
out = list(self.mw.process_spider_output(res, reqs, self.spider))
test output:
$ trial scrapy/tests/test_spidermiddleware_offsite.py
scrapy.tests.test_spidermiddleware_offsite
TestOffsiteMiddleware
test_process_spider_output ... [OK]
TestOffsiteMiddleware2
test_process_spider_output ... [OK]
TestOffsiteMiddleware3
test_process_spider_output ... [OK]
-------------------------------------------------------------------------------
Ran 3 tests in 0.057s
PASSED (successes=3)
@nramirezuy : OffisteMW doesn't apply to redirects, right, but I was mentioning it in the context of links extracted from html pages. The idea with redirects is to disallow others than https?://
.
Hi folks, any updates regarding this issue?
EDIT: here's what might happen when follow redirect (from http:// to file://) is a default behavior http://blog.detectify.com/post/82370846588/how-we-got-read-access-on-googles-production-servers
I think that file://
links are fine only when it is developer who wrote them. Most common case (if not the only) for that is start_requests
(and start_urls
). It is convenient to have file://
support in start_requests
, but IMHO they should be disabled everywhere else.
I don't like solutions with some global switch (be it a setting or a middleware) for several reasons:
1) developer should not forget to make this switch - it is a security issue itself;
2) there are cases when OffsiteMiddleware should be off, and we shouldn't have a security issue in this case;
3) explicitly written file://
could be useful even in production (e.g. a sitemap bundled with a spider).
https://github.com/scrapy/scrapy/issues/457#issuecomment-28919683 solution doesn't have these problems. There is a gotcha with redirects where allowed_protocols
should be reset. I wonder if it even worths to make Request.replace
take care of it, resetting alowed_protocols
where url is changed (+ copy existing replace to Request._replace_unsafe
).
I'm ok with @kmike's allowed_protocols
proposal
RedirectMiddleware
needs other changes too, for example removing proxy
key from Request.meta when scheme changes, and also removing some HTTP headers (get rid of proxy-specific header for example) in redirected requests.
This is a good opportunity to clean/tighten it up.
What's the core maintainers' take on this? CC @dangra , @nramirezuy , @pablohoffman , @kmike
@redapple I don't know if removing the proxy is a good idea, also the cookies may not change. Did you checked how the browser handle it?
@nramirezuy , if the redirected URL of a proxied request changes scheme, the proxy doesn't change (e.g. use an HTTPS proxy for an http:// URL), that feels weird and wrong, and python-requests doesn't do that. My comment was to remove the "proxy" key, so it's added in the rest of the downloader chain later if necessary. That's not strictly part of this issue.
I believe this is not a bug, but could fit in as a security enhancement.
TL;DR When following HTTP redirects, scrapy should only follow http/https requests. Quite possibly similar to (CVE-2009-0037) http://curl.haxx.se/docs/adv_20090303.html
Scenario: a. A spider that simply stores raw or minimally sanitised data. b. By default scrapy follows HTTP redirects. c. The values of both DOWNLOAD_HANDLERS and DOWNLOAD_HANDLERS_BASE are set to their default.
And let's say a malicious client might issue a redirect such as
And so, my spider will be "leaking" my own "/etc/passwd" file because Scrapy followed the 301 redirect from "http:// ... /trap.html" to "file:///etc/passwd".
Shouldn't scrapy enforce some sort of protocol restriction so that it only follows the http/https protocols when following redirects?