[Security] CVE-2007-4559: tarfile: Add absolute_path option to tarfile, disabled by default

vstinner commented 7 years ago

BPO	29788
Nosy	@gustaebel, @vstinner, @jwilk, @berkerpeksag, @vadmium

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-security', 'library', '3.11'] title = '[Security] tarfile: Add absolute_path option to tarfile, disabled by default' updated_at = user = 'https://github.com/vstinner' ``` bugs.python.org fields: ```python activity = actor = 'ned.deily' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'vstinner' dependencies = [] files = [] hgrepos = [] issue_num = 29788 keywords = [] message_count = 2.0 messages = ['289388', '289437'] nosy_count = 5.0 nosy_names = ['lars.gustaebel', 'vstinner', 'jwilk', 'berker.peksag', 'martin.panter'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'security' url = 'https://bugs.python.org/issue29788' versions = ['Python 3.11'] ```

vstinner commented 7 years ago

I noticed that "python3 -m tarfile -x archive.tar" uses absolute paths by default, whereas the UNIX tar command doesn't by default. The UNIX tar command requires to add explicitly --absolute-paths (-P) option.

I suggest to add a boolean absolute_path option to tarfile, disabled by default.

Example to create such archive. See that tar also removes "/" by default and requires to pass explicitly -P:

$ cd $HOME
# /home/haypo
$ echo TEST > test
$ tar -cf test.tar /home/haypo/test
tar: Removing leading `/' from member names

$ rm -f test.tar
$ tar -P -cf test.tar /home/haypo/test
$ rm -f test

Extracting such archive using tar is safe *by default*:

$ mkdir z
$ cd z
$ tar -xf ~/test.tar
tar: Removing leading `/' from member names
$ find
.
./home
./home/haypo
./home/haypo/test

Extracting such archive using Python is unsafe:

$ python3 -m tarfile -e ~/test.tar
$ cat ~/test
TEST
$ pwd
/home/haypo/z

Python creates files outside the current directory which is unsafe, wheras tar doesn't.

vadmium commented 7 years ago

The CLI was added in bpo-13477. I didn’t see any discussion of traversal attacks there, so maybe it was overlooked. Perhaps there should also be a warning, like with the Tarfile.extract and “extractall” methods.

However I did see one of the goals was to keep the CLI simple, which I agree with. I would suggest that the CLI get this proposed behaviour by default (matching the default behaviour of modern “tar” commands), with no option to restore the current less-robust behaviour.

To implement it, I suggest to fix the module internals first: bpo-21109 and/or bpo-17102.

FWIW BSD calls the option “--absolute-paths” (plural paths) \https://www.freebsd.org/cgi/man.cgi?tar%281%29#OPTIONS\, while Gnu calls it “--absolute-names” \https://www.gnu.org/software/tar/manual/html_chapter/tar_6.html#SEC121\. Both these options disable other checks, such as for parent directories (..) and external symbolic link targets, so I think the term “absolute” is too specific. But please use at least replace the underscore with a dash or hyphen: “--absolute-path”, not “--absolute_path”.

jowagner commented 2 years ago

Questions:

Can we use https://bugs.python.org/file8339/insecure_pathnames.diff (found in issue #45385) as a basis for a PR? How do we give credit if @gustaebel is not the contributor but just another user with the same surname?
Can we use the abspath + commonprefix approach from https://gist.github.com/Kasimir123/9bdc38f2eac9a19f83ef6cc52c7ce82e?
Changing behaviour can break existing software, e.g. restoring a backup that uses absolute paths. How much notice should be given in this case? On the one hand, the issue should be fixed soon as it impacts security. On the other hand, what are a few years more after 15 years? What is the process for ensuring the defaults are changed for a specific target version or year?
absolute-path is not a good name if we also test for .. path components.
In the safe mode, should we (a) bail out as soon as an absolute path is encountered or (b) only if the absolute path brings us outside the hierarchy of the target directory (the current working directory by default), or (c) only if the effective target directory is not '/', or (d) only if '/' is not explicitly specified as the target directory (parameter path of the extract functions)? Consider '/etc/passwd' in a tar file. Should this file allowed to be written in safe mode when inside '/etc'? Should it be ok if running from the top-level '/' root folder? Should it only work if path='/' is set when calling an extract function of the tarfile module? The latter is something we could ask developers of restore utilities to do to continue being able to restore backups produced with absolute paths (and this would also be compatible with future backups stripping the leading /).
The patch in issue #45385 suggests a parameter check_path. While this naming does not make explicit what is being checked, it can cover all future security issues arising from tarfile member names. Or would it be better to have a separate parameter for each check?
Should the idea to allow the user to provide a function with check_path that returns True/False for whether to skip the entry (if an exception is desired the function can raise it - no need to define a return value for that) or even allows to change the path (returning a string) be implemented in the same PR or should this be a separate PR?
Could a specially crafted tar file cause the tarfile module to create a symlink to a directory outside the target tree, e.g. example --> / and then contain a member example/etc/passwd that overwrites /etc/passwd? To protect against this, one could call realpath() but this would also evaluate pre-existing symlinks. Users may expect the old behaviour e.g. when receiving updates via tar files for multiple folders and one of the folders has been moved to a different filesystem on the receiving system. To satisfy the expectation of these users and being safe, we need a version of realpath() that does not consult the actual filesystem but only the folders and symlinks found in the tar file so far. [Edit: Yes, see https://github.com/python/cpython/issues/65308#issuecomment-1093650963]

zhuofeng6 commented 2 years ago

Which versions are affected by this CVE?

nanonyme commented 2 years ago

To be honest, whenever path is non-empty, I would expect it to be regarded "safe" to the extent that any final path must be contained inside the path user gives. So essentially passing relative paths would be fine as long as it does not escape the directory given by the user of the API. So eg I give path "/", then tarfile is allowed to unpack anywhere. I give path "", then there is no checking whatsoever (for backwards compat reasons). When I give path "something-else", then trying to unpack outside "something-else" raises an exception.

nanonyme commented 2 years ago

Another possibility would be raising audit event for the path names so you can write an audit hook that prevents unpacking unsafe paths. This implies no API changes.

nanonyme commented 2 years ago

@alittlesir everything everywhere. Documentation says you must check paths to be safe before unpacking.

rooterkyberian commented 2 years ago

As a commercial python developer I would greatly appreciate this to become part of CPython.

So while "technically correct" python implementation was deemed "non-security issue" (#45385), the fact that it resoluted in thousands of vulnerable projects seems to indicate that in retrospect, it is an issue. Despite documentation warning. Like great, we have documentation which says to watch out, but here we are. People are not reading it. And all that for "being technically correct" and in support of a feature that hardly every anyone wants or needs.

Alternative non-breaking (and I really wonder which projects would be broken here) would be to have a subclass that does the protecting by default, in similar fashion as tarsafe project (https://github.com/beatsbears/tarsafe , but with different implementation, as the one there seems quite inefficient).

nanonyme commented 2 years ago

I would prefer secure by default. This is implicitly used by various API's like shutil.unpack_archive which does not check TarInfo items are safe paths. Even standard library is not following tarfile module documentation.

gustaebel commented 2 years ago

I wrote a comment back in 2014 on the various security aspects of tarfile: https://github.com/python/cpython/issues/65308#issuecomment-1093650963 I think it is still worth a read.

nanonyme commented 2 years ago

@gustaebel yes. I was mostly talking about other higher level tools in standard library that build on top of tarfile not being remotely secure either. I suggested adding audit event for unpacked TarInfo, that would easily allow gluing security on top without dropping all these nice things standard library has gathered. As said, shutil.unpack_archive will automatically call into insecure code in tarfile behind your back after sniffing that it needs tarfile based on file extension. This is not great design.

jorhett commented 2 years ago

I'd like to point out that PyPI itself is at risk, given that PEP-458 for signed package metadata has still not been implemented a decade after it was proposed. In short, every PyPI package should be considered "untrusted". While pip does check for filename extraction paths, one could simply add an installation script that runs tarfile directly to achieve the exploit.

nanonyme commented 2 years ago

@jorhett the funny thing there is that this issue is about tarfile when in fact zipfile has the exact same vulnerability class considered by the CVE and PyPI has now converged around wheels that are essentially ZIP packages. Back when the CVE was filed tarfile was still used more.

jorhett commented 2 years ago

The not-funny thing is that PyPI packing is not cryptographically verifiable, there are no published instructions for validating PyPI packages, and yet Python maintainers consider it the user's responsibility to inspect the code before using it.

I do this for a living, and Python is the hardest to evaluate. It has the least verifications/validations of origin possible. So we have to build a VM and diff before/after to be certain of what it does. Yeah, all the people using Python plugins in their tools are responsible for taking this kind of effort 🤦

vstinner commented 2 years ago

See also issue #94531 "Document zlib, gzip, bz2 and tarfile known vulnerabilities".

zhuofeng6 commented 2 years ago

it seems to have a big impact. when are you going to fix this cve? @vstinner

vstinner commented 2 years ago

it seems to have a big impact. when are you going to fix this cve? @vstinner

I don't plan to implement my proposed idea soon, but if someone does, I will review the change.

Anyone is free to propose a fix!

jowagner commented 2 years ago

There are multiple proposals and various issues identified in comments here and in linked issues. Before implementing a solution lets first read through all comments and agree on a proposal. Where there is a trade-off between safety and breaking existing applications, the right balance needs to be found, including the possibility to schedule changes to a later release and printing warnings about future changes in the meantime.

zhuofeng6 commented 2 years ago

Delete the first slash (“/”) character in the absolute path to make it a relative path. Is it any risks?

nanonyme commented 2 years ago

That's not a reliable approach. The better is to normalize paths inside extract directory and making sure they don't lead outside it (which you can see with with '..' not being in normalized path). There was a concern of breaking old API assumptions though.

jowagner commented 2 years ago

Delete the first slash (“/”) character in the absolute path to make it a relative path. Is it any risks?

Yes if risks include breaking existing apps and/or overwriting files that should not have been overwritten. A backup/restore app that assumes all paths are absolute may not have moved to an appropriate working directory nor specified a target path when extracting its own archives.

I think the behaviour for absolute path should be changed in two stages:

Print a warning when encountering the first absolute path in an archive that the behaviour is scheduled to be changed and where to find more information, and add an option to the command line tool and a parameter to the extract functions of the module to control the behaviour (no warning is printed if this new parameter is used) allowing the user to request one of (a) abort / raise an exception when encountering an absolute path, (b) make it a relative path by deleting the first slash (@alittlesir 's suggestion) and (c) keep the absolute path, attempting to write where the archive says to write.
Change the default behaviour to raising an exception if an absolute path is encountered and not one of the two other behaviours has been requested.

Edit: Behaviour (d): skip the entry.

jowagner commented 2 years ago

[...] The better is to normalize paths inside extract directory and making sure they don't lead outside it

As far as I understand detailed comments linked above, this is also not enough as the archive can first create a symlink foo to a directory /target and then write to foo/bar to overwrite /target/bar. However, if the user has a symlink foo to a location outside the tree before extraction starts and the archive wants to write to foo/bar that should be accepted as the user clearly wants the data in foo to be stored elsewhere.

I think a flexible approach can be implemented by passing a function or object as an extra parameter to the extract functions of the module. This function (or a ~~function of the~~ callable object) can raise an exception, return a modified path to use or return None to skip the item silently. As paths are checked on the fly, this should also work when reading an archive from stdin.

For the convenience of the user of the module, we can provide functions that implement likely wanted behaviours. For detecting attempts to write using a symlink previously created by the archive, we need to keep state between calls, hence the suggestion to support passing a validator object. The validator currently considered the safest could be made available as tarfile.secure_path_check and become the default value in a future release.

Edit: To make the later thread safe, a new object would have to be created at each call of an extraction function so that two calls do not share the same validator object.

Edit 2: A "make the path safe" behaviour for ../ still being in the path after normalisation could be to replace each occurrence with __/.

Edit 3: Mention "callable object" as this may be the simplest way to support both functions and validator objects.

Denelvo commented 2 years ago

I fail to see why the discussion about lost functionality needs to drag on for so long. The Linux/UNIX/POSIX tar command recognized the danger and changed its default behaviour. IIRC, untarring now gives an error when '..' is encountered in a path and halts. That's the default. There might be a special flag added to emulate the old behaviour.

Isn't the solution as simple as: emulate the tar command? Just follow upstream changes?

See https://github.com/python/cpython/issues/97663 for an example of what tar now does.

nanonyme commented 2 years ago

@jowagner I see. Yes, I would expect symlinks to be resolved during normalisation but maybe this is not in general reasonable. There could be multiple levels of symlinks.

jowagner commented 2 years ago

[...] There could be multiple levels of symlinks.

Yes, the validator would have a bit of work to do to keep track of all the symlinks. There can be legitimate uses of archives containing a large number of symlinks so bailing out after an unusual high number of symlinks (millions?) would be problematic. At the same time, memory usage of the validator could grow unacceptedly for such archives.

If the target path is an empty folder at the start, we can just use os.path.realpath() instead of replicating symlink behaviour inside the validator. Therefore, a compromise could be to provide protection from symlinks only if the target folder is empty and to advise users to only extract untrusted archives to an empty folder only. The validator would then simply call realpath() and not keep its own memory of symlinks. (We could also use realpath() as long as there is no symlink to outside the tree anywhere in the tree but in general it is not a good idea to check the tree because the tree could be big, much bigger than the archive.)

jowagner commented 2 years ago

Isn't the solution as simple as: emulate the tar command? Just follow upstream changes?

This approach was rejected in 2007, see https://github.com/python/cpython/issues/45385#issuecomment-1093394689

Hence the attempt to move this forward as a feature that changes the default behaviour. I further suggest to add value beyond safety, namely allowing the module user to filter and modify paths of archive members on the fly.

zhuofeng6 commented 2 years ago

The existing interface (extractall) remains unchanged. An interface is added to provide decompression of relative paths.

zhuofeng6 commented 2 years ago

Is this cve not had that much impact Unpatched 15-year old Python bug allows code execution in 350k projects, that is just a scare. it seems that the community don't think it is cve in 2007, am i right? and so why do you think it's a security issue now? Is it because of the public's attention? @vstinner

Denelvo commented 2 years ago

@alittlesir @jowagner Maybe I can answer that.

2007 Was a long time ago, and things change. What was perfectly acceptable in 2007, is now a CVE with a high severity score.

"It's not exploitable" was true in 2007. Now there is a video of (what is deemed) an exploit: https://youtu.be/jqs8S51_ENg

I'm not going to debate whether it's really an exploitable vulnerability, or just a trick to make a user do something, not elevating any privileges at all. I suggest everyone watch the video and draws their own conclusions on that.

I work for a bank, BTW. In my context, we'd patch the thing that even just smells like a vulnerability quickly, and worry about extra/new functionality later.

But all is well with the world. The debate can happen on here, the added features for filtering can be added in due time or not at all. But in the meantime, the CVE is registered and it is scored as a 5.6. LINK The companies which have a high level of security paranoia will have a code scanner in place which alerts on the use of the module, and they can act accordingly.

nanonyme commented 2 years ago

@Denelvo tarfile API clearly states you must manually sanitize filenames yourself. So does zipfile. What is more problematic is shutil.unpack_archive does not expose an API to do this. So clearly it must sanitize files for you as a high-level API. It probably did not exist back in 2007.

gustaebel commented 2 years ago

@Denelvo I rather suggest everyone take a look at the actual code in Spyder IDE that is the basis for the exploit (from this blog article from Trellix).

A tar archive is extracted in its entirety to the filesystem, just like that. No checks whatsoever.
Actually only one file from the archive is needed. It is not only completely unnecessary to use TarFile.extractall(), but there would simply be no exploit if TarFile.extractfile() was used.
The file in question turns out to be a pickle dump which is opened and loaded via pickle.loads().

To quote the pickle documentation:

The pickle module is not secure. Only unpickle data you trust.

This is a security vulnerability in Spyder IDE.

nanonyme commented 2 years ago

Kind of yes but also if you have code that doesn't sanitize unpacking paths, then unrelated code cannot trust any path in entire hard disk anymore for safe unpickling. And you could even simpler attack write on Linux .bashrc file into user home for arbitrary code execution afterwards.

jowagner commented 2 years ago

Making tarfile safe by default was never expected to solve all software problems such as unsafe use of the pickle module.

Can we move on to discussing what the PR should look like?

How about the following default behaviour:

All extract*() functions call a check_path function for each member, using the return value as the final path and skipping the member if the return value is empty or None.
If the target folder is empty, use a check_path function that raises an exception if the realpath is outside the target folder. This covers ../, / and symlink attacks.
If the target folder is not empty, use a check_path function that raises an exception if the normalised path (without symlink resolution) is or may be leading outside the folder (adding "may be" because ../ does not lead outside the target folder /). This does not protect against symlink attacks as trying to distinguish between pre-existing symlinks (that the user expects to be followed) and new symlinks from the archive would introduce new attack vectors. One could also argue to drop all checks for non-empty target folders as a symlink attack can achieve everything the other two attacks can do.

At minimum, the tarfile module should provide a check_path function that restores the current unsafe behaviour. For convenience, this could be selected with check_path=None.

nsoranzo commented 2 years ago

2. Can we use the abspath + commonprefix approach from https://gist.github.com/Kasimir123/9bdc38f2eac9a19f83ef6cc52c7ce82e?

I don't think that's enough: https://gist.github.com/Kasimir123/9bdc38f2eac9a19f83ef6cc52c7ce82e?permalink_comment_id=4358347#gistcomment-4358347

jowagner commented 2 years ago

Digging deeper in the GNU tar behaviour with pre-existing symlinks, it seems that tar follows a pre-existing symlink when the file foo/bar is not proceeded by a tarfile member foo of type directory. If there is such a member, the symlink is replaced by a new folder foo and bar is placed inside it. A symlink foo --> /tmp is extracted without complains. Only when followed by a member using this symlink an error Cannot open: Not a directory is printed. (This error message is strange as tar happily writes foo/bar through a pre-existing symlink if there is no member foo before in the archive. The error message is also printed when there is a symlink in the archive matching the pre-existing symlink.)

How does GNU tar do this? I think it must be storing all folders and symlinks in memory as clearly the tarfile members are not evaluated in isolation and the state of the filesystem is not sufficient to explain behaviour differences. Do we want this level of protection against symlink attacks where the target folder is non-empty (for empty target folders, we can simply check paths with realpath()) at the expense of possibly high memory usage during extraction?

mandolaerik commented 2 years ago

Note that the symlink trick can be applied also to file symlinks; the tar file created by:

import tarfile
from pathlib import Path
with tarfile.open('foo.tar', 'w') as tf:
    ti = tarfile.TarInfo('foo')
    ti.type = tarfile.SYMTYPE
    ti.linkname = '../../../etc/passwd'
    tf.addfile(ti)
    Path('foo').write_text('garbage')
    tf.add('foo')

will try to write passwd when extracted by extractall. It seems that tar x doesn't have this problem. The following naive fix plugs this hole (only tested manually):

index 169c88d63f..aa4513e510 100755
--- a/Lib/tarfile.py
+++ b/Lib/tarfile.py
@@ -2211,6 +2211,8 @@ def makefile(self, tarinfo, targetpath):
         source = self.fileobj
         source.seek(tarinfo.offset_data)
         bufsize = self.copybufsize
+        if os.path.lexists(targetpath) and not os.path.isdir(targetpath):
+            os.unlink(targetpath)
         with bltn_open(targetpath, "wb") as target:
             if tarinfo.sparse is not None:
                 for offset, size in tarinfo.sparse:

(needs some more work, like a test, plus I suppose makefifo/makedev also need a corresponding fix)

pombredanne commented 1 year ago

FWIW, there is a spam bot PR campaign by @TrellixVulnTeam and @Kasimir123 (they should be banned) that has been going on since September and spams a large number of python projects with junk PRs with incorrect patches.

Examples include:

JLLeitschuh commented 1 year ago

Hi All,

My name is Jonathan Leitschuh, I'm a Senior Software Security Researcher for Project Alpha Omega at the Open Source Security Foundation under the Linux Foundation.

I recently attended a talk by @Kasimir123 at ShmooCon about this vulnerability and this work. They recently opened 61,895 pull requests across the OSS ecosystem to fix this vulnerability as the Python maintainers have taken no action to fix this severe vulnerability after 6 years.

To be clear, vulnerabilities like TAR-slip (aka Zip-slip) can lead to remote code execution by allowing an attacker to write code into executable file locations. In the right contexts, this vulnerability has a CVSSv3.1 score of 10/10 (AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H).

While I believe that fixing the simlink problem is important. I believe that this should be broken down into two issues:

Destination escape by path traversal
Destination escape by simlink

Given the first is far more likely to be exploited in the real world, I propose fixing it first, then focusing on the second problem.

This will ensure that end-users are protected against the most likely risk sooner, while the less likely risk discussion continues.

To provide an incentive: I would also like to note that any developer who resolves this security issue in the python standard library would most likely be eligible for a multi-thousand dollar financial reward under the Secure Open Source Rewards (https://sos.dev) project which is also operated out of the Open Source Security Foundation.

encukou commented 1 year ago

FWIW, I'm reading up on this issue, and I plan to post a proposal on Discourse this week to unblock things.

jowagner commented 1 year ago

@encukou We have multiple proposals in the comments here and the linked issues. The problem is that no proposal has received positive feedback from the core cpython team to motivate one of us go to the next step of implementing it and making a PR. Since a previous PR fixing the issue has been rejected, rewards from a new PR (financial or not) are uncertain. If you make yet another proposal please link to it from here.

encukou commented 1 year ago

Yeah, it would be unfair if I ended up getting money for this, outside my normal salary. Any solution will necessarily be based on lots of prior work and discussion, by people who seem to have burned out trying to fix the issue. I won't apply for any bounty, and if I get one anyway I'll donate it to the PSF.

encukou commented 1 year ago

My proposal: https://discuss.python.org/t/23149

encukou commented 1 year ago

PEP-706, which adds filter rather than absolute_path, has been implemented in #102950. See the added docs.

Python 3.12, and security updates to some earlier releases, will allow users to avoid CVE-2007-4559 by changing their code/settings. Python 3.12 will emit a warning urging people to do that. Python 3.14 will fix CVE-2007-4559 by making safer behaviour the default.

nanonyme commented 1 year ago

@encukou have I missed something or do these changes now break shutil.unpack_archive abstraction layer for zipfile vs tarfile?

encukou commented 1 year ago

Yes. Sadly, in Python 3.12 and 3.13 you'll get a warning about changing behaviour if you use shutil.unpack_archive with a tarball. You can ignore the warning, or inspect the archive and pass filter='data' if it's a tarball. Or if you can change settings of the whole Python interpreter, set tarfile.TarFile.extraction_filter = tarfile.data_filter.

python / cpython

[Security] CVE-2007-4559: tarfile: Add absolute_path option to tarfile, disabled by default #73974