Closed vstinner closed 1 year ago
I noticed that "python3 -m tarfile -x archive.tar" uses absolute paths by default, whereas the UNIX tar command doesn't by default. The UNIX tar command requires to add explicitly --absolute-paths (-P) option.
I suggest to add a boolean absolute_path option to tarfile, disabled by default.
Example to create such archive. See that tar also removes "/" by default and requires to pass explicitly -P:
$ cd $HOME
# /home/haypo
$ echo TEST > test
$ tar -cf test.tar /home/haypo/test
tar: Removing leading `/' from member names
$ rm -f test.tar
$ tar -P -cf test.tar /home/haypo/test
$ rm -f test
Extracting such archive using tar is safe *by default*:
$ mkdir z
$ cd z
$ tar -xf ~/test.tar
tar: Removing leading `/' from member names
$ find
.
./home
./home/haypo
./home/haypo/test
Extracting such archive using Python is unsafe:
$ python3 -m tarfile -e ~/test.tar
$ cat ~/test
TEST
$ pwd
/home/haypo/z
Python creates files outside the current directory which is unsafe, wheras tar doesn't.
The CLI was added in bpo-13477. I didn’t see any discussion of traversal attacks there, so maybe it was overlooked. Perhaps there should also be a warning, like with the Tarfile.extract and “extractall” methods.
However I did see one of the goals was to keep the CLI simple, which I agree with. I would suggest that the CLI get this proposed behaviour by default (matching the default behaviour of modern “tar” commands), with no option to restore the current less-robust behaviour.
To implement it, I suggest to fix the module internals first: bpo-21109 and/or bpo-17102.
FWIW BSD calls the option “--absolute-paths” (plural paths) \https://www.freebsd.org/cgi/man.cgi?tar%281%29#OPTIONS\, while Gnu calls it “--absolute-names” \https://www.gnu.org/software/tar/manual/html_chapter/tar_6.html#SEC121\. Both these options disable other checks, such as for parent directories (..) and external symbolic link targets, so I think the term “absolute” is too specific. But please use at least replace the underscore with a dash or hyphen: “--absolute-path”, not “--absolute_path”.
Questions:
absolute-path
is not a good name if we also test for ..
path components.path
of the extract functions)? Consider '/etc/passwd' in a tar file. Should this file allowed to be written in safe mode when inside '/etc'? Should it be ok if running from the top-level '/' root folder? Should it only work if path='/'
is set when calling an extract function of the tarfile module? The latter is something we could ask developers of restore utilities to do to continue being able to restore backups produced with absolute paths (and this would also be compatible with future backups stripping the leading /
).check_path
. While this naming does not make explicit what is being checked, it can cover all future security issues arising from tarfile member names. Or would it be better to have a separate parameter for each check?check_path
that returns True/False for whether to skip the entry (if an exception is desired the function can raise it - no need to define a return value for that) or even allows to change the path (returning a string) be implemented in the same PR or should this be a separate PR?example --> /
and then contain a member example/etc/passwd
that overwrites /etc/passwd
? To protect against this, one could call realpath()
but this would also evaluate pre-existing symlinks. Users may expect the old behaviour e.g. when receiving updates via tar files for multiple folders and one of the folders has been moved to a different filesystem on the receiving system. To satisfy the expectation of these users and being safe, we need a version of realpath()
that does not consult the actual filesystem but only the folders and symlinks found in the tar file so far. [Edit: Yes, see https://github.com/python/cpython/issues/65308#issuecomment-1093650963]Which versions are affected by this CVE?
To be honest, whenever path is non-empty, I would expect it to be regarded "safe" to the extent that any final path must be contained inside the path user gives. So essentially passing relative paths would be fine as long as it does not escape the directory given by the user of the API. So eg I give path "/", then tarfile is allowed to unpack anywhere. I give path "", then there is no checking whatsoever (for backwards compat reasons). When I give path "something-else", then trying to unpack outside "something-else" raises an exception.
Another possibility would be raising audit event for the path names so you can write an audit hook that prevents unpacking unsafe paths. This implies no API changes.
@alittlesir everything everywhere. Documentation says you must check paths to be safe before unpacking.
As a commercial python developer I would greatly appreciate this to become part of CPython.
So while "technically correct" python implementation was deemed "non-security issue" (#45385), the fact that it resoluted in thousands of vulnerable projects seems to indicate that in retrospect, it is an issue. Despite documentation warning. Like great, we have documentation which says to watch out, but here we are. People are not reading it. And all that for "being technically correct" and in support of a feature that hardly every anyone wants or needs.
Alternative non-breaking (and I really wonder which projects would be broken here) would be to have a subclass that does the protecting by default, in similar fashion as tarsafe project (https://github.com/beatsbears/tarsafe , but with different implementation, as the one there seems quite inefficient).
I would prefer secure by default. This is implicitly used by various API's like shutil.unpack_archive which does not check TarInfo items are safe paths. Even standard library is not following tarfile module documentation.
I wrote a comment back in 2014 on the various security aspects of tarfile: https://github.com/python/cpython/issues/65308#issuecomment-1093650963 I think it is still worth a read.
@gustaebel yes. I was mostly talking about other higher level tools in standard library that build on top of tarfile not being remotely secure either. I suggested adding audit event for unpacked TarInfo, that would easily allow gluing security on top without dropping all these nice things standard library has gathered. As said, shutil.unpack_archive will automatically call into insecure code in tarfile behind your back after sniffing that it needs tarfile based on file extension. This is not great design.
I'd like to point out that PyPI itself is at risk, given that PEP-458 for signed package metadata has still not been implemented a decade after it was proposed. In short, every PyPI package should be considered "untrusted". While pip
does check for filename extraction paths, one could simply add an installation script that runs tarfile directly to achieve the exploit.
@jorhett the funny thing there is that this issue is about tarfile when in fact zipfile has the exact same vulnerability class considered by the CVE and PyPI has now converged around wheels that are essentially ZIP packages. Back when the CVE was filed tarfile was still used more.
The not-funny thing is that PyPI packing is not cryptographically verifiable, there are no published instructions for validating PyPI packages, and yet Python maintainers consider it the user's responsibility to inspect the code before using it.
I do this for a living, and Python is the hardest to evaluate. It has the least verifications/validations of origin possible. So we have to build a VM and diff before/after to be certain of what it does. Yeah, all the people using Python plugins in their tools are responsible for taking this kind of effort 🤦
See also issue #94531 "Document zlib, gzip, bz2 and tarfile known vulnerabilities".
it seems to have a big impact. when are you going to fix this cve? @vstinner
it seems to have a big impact. when are you going to fix this cve? @vstinner
I don't plan to implement my proposed idea soon, but if someone does, I will review the change.
Anyone is free to propose a fix!
There are multiple proposals and various issues identified in comments here and in linked issues. Before implementing a solution lets first read through all comments and agree on a proposal. Where there is a trade-off between safety and breaking existing applications, the right balance needs to be found, including the possibility to schedule changes to a later release and printing warnings about future changes in the meantime.
Delete the first slash (“/”) character in the absolute path to make it a relative path. Is it any risks?
That's not a reliable approach. The better is to normalize paths inside extract directory and making sure they don't lead outside it (which you can see with with '..' not being in normalized path). There was a concern of breaking old API assumptions though.
Delete the first slash (“/”) character in the absolute path to make it a relative path. Is it any risks?
Yes if risks include breaking existing apps and/or overwriting files that should not have been overwritten. A backup/restore app that assumes all paths are absolute may not have moved to an appropriate working directory nor specified a target path when extracting its own archives.
I think the behaviour for absolute path should be changed in two stages:
Edit: Behaviour (d): skip the entry.
[...] The better is to normalize paths inside extract directory and making sure they don't lead outside it
As far as I understand detailed comments linked above, this is also not enough as the archive can first create a symlink foo
to a directory /target
and then write to foo/bar
to overwrite /target/bar
. However, if the user has a symlink foo
to a location outside the tree before extraction starts and the archive wants to write to foo/bar
that should be accepted as the user clearly wants the data in foo
to be stored elsewhere.
I think a flexible approach can be implemented by passing a function or object as an extra parameter to the extract functions of the module. This function (or a function of the callable object) can raise an exception, return a modified path to use or return None to skip the item silently. As paths are checked on the fly, this should also work when reading an archive from stdin
.
For the convenience of the user of the module, we can provide functions that implement likely wanted behaviours. For detecting attempts to write using a symlink previously created by the archive, we need to keep state between calls, hence the suggestion to support passing a validator object. The validator currently considered the safest could be made available as tarfile.secure_path_check
and become the default value in a future release.
Edit: To make the later thread safe, a new object would have to be created at each call of an extraction function so that two calls do not share the same validator object.
Edit 2: A "make the path safe" behaviour for ../
still being in the path after normalisation could be to replace each occurrence with __/
.
Edit 3: Mention "callable object" as this may be the simplest way to support both functions and validator objects.
<speaking without intimate knowledge of the module and its history, please excuse if I'm talking rubbish>
I fail to see why the discussion about lost functionality needs to drag on for so long. The Linux/UNIX/POSIX tar command recognized the danger and changed its default behaviour. IIRC, untarring now gives an error when '..' is encountered in a path and halts. That's the default. There might be a special flag added to emulate the old behaviour.
Isn't the solution as simple as: emulate the tar command? Just follow upstream changes?
See https://github.com/python/cpython/issues/97663 for an example of what tar now does.
@jowagner I see. Yes, I would expect symlinks to be resolved during normalisation but maybe this is not in general reasonable. There could be multiple levels of symlinks.
[...] There could be multiple levels of symlinks.
Yes, the validator would have a bit of work to do to keep track of all the symlinks. There can be legitimate uses of archives containing a large number of symlinks so bailing out after an unusual high number of symlinks (millions?) would be problematic. At the same time, memory usage of the validator could grow unacceptedly for such archives.
If the target path is an empty folder at the start, we can just use os.path.realpath()
instead of replicating symlink behaviour inside the validator. Therefore, a compromise could be to provide protection from symlinks only if the target folder is empty and to advise users to only extract untrusted archives to an empty folder only. The validator would then simply call realpath()
and not keep its own memory of symlinks. (We could also use realpath()
as long as there is no symlink to outside the tree anywhere in the tree but in general it is not a good idea to check the tree because the tree could be big, much bigger than the archive.)
Isn't the solution as simple as: emulate the tar command? Just follow upstream changes?
This approach was rejected in 2007, see https://github.com/python/cpython/issues/45385#issuecomment-1093394689
Hence the attempt to move this forward as a feature that changes the default behaviour. I further suggest to add value beyond safety, namely allowing the module user to filter and modify paths of archive members on the fly.
The existing interface (extractall
) remains unchanged. An interface is added to provide decompression of relative paths.
Is this cve not had that much impact
Unpatched 15-year old Python bug allows code execution in 350k projects
, that is just a scare.
it seems that the community don't think it is cve in 2007, am i right? and so why do you think it's a security issue now? Is it because of the public's attention? @vstinner
@alittlesir @jowagner Maybe I can answer that.
2007 Was a long time ago, and things change. What was perfectly acceptable in 2007, is now a CVE with a high severity score.
"It's not exploitable" was true in 2007. Now there is a video of (what is deemed) an exploit: https://youtu.be/jqs8S51_ENg
I'm not going to debate whether it's really an exploitable vulnerability, or just a trick to make a user do something, not elevating any privileges at all. I suggest everyone watch the video and draws their own conclusions on that.
I work for a bank, BTW. In my context, we'd patch the thing that even just smells like a vulnerability quickly, and worry about extra/new functionality later.
But all is well with the world. The debate can happen on here, the added features for filtering can be added in due time or not at all. But in the meantime, the CVE is registered and it is scored as a 5.6. LINK The companies which have a high level of security paranoia will have a code scanner in place which alerts on the use of the module, and they can act accordingly.
@Denelvo tarfile API clearly states you must manually sanitize filenames yourself. So does zipfile. What is more problematic is shutil.unpack_archive does not expose an API to do this. So clearly it must sanitize files for you as a high-level API. It probably did not exist back in 2007.
@Denelvo I rather suggest everyone take a look at the actual code in Spyder IDE that is the basis for the exploit (from this blog article from Trellix).
TarFile.extractall()
, but there would simply be no exploit if TarFile.extractfile()
was used.pickle.loads()
.To quote the pickle documentation:
The pickle module is not secure. Only unpickle data you trust.
This is a security vulnerability in Spyder IDE.
Kind of yes but also if you have code that doesn't sanitize unpacking paths, then unrelated code cannot trust any path in entire hard disk anymore for safe unpickling. And you could even simpler attack write on Linux .bashrc file into user home for arbitrary code execution afterwards.
Making tarfile safe by default was never expected to solve all software problems such as unsafe use of the pickle module.
Can we move on to discussing what the PR should look like?
How about the following default behaviour:
extract*()
functions call a check_path
function for each member, using the return value as the final path and skipping the member if the return value is empty or None
.check_path
function that raises an exception if the realpath is outside the target folder. This covers ../
, /
and symlink attacks.check_path
function that raises an exception if the normalised path (without symlink resolution) is or may be leading outside the folder (adding "may be" because ../
does not lead outside the target folder /
). This does not protect against symlink attacks as trying to distinguish between pre-existing symlinks (that the user expects to be followed) and new symlinks from the archive would introduce new attack vectors. One could also argue to drop all checks for non-empty target folders as a symlink attack can achieve everything the other two attacks can do.At minimum, the tarfile module should provide a check_path
function that restores the current unsafe behaviour. For convenience, this could be selected with check_path=None
.
2. Can we use the abspath + commonprefix approach from https://gist.github.com/Kasimir123/9bdc38f2eac9a19f83ef6cc52c7ce82e?
I don't think that's enough: https://gist.github.com/Kasimir123/9bdc38f2eac9a19f83ef6cc52c7ce82e?permalink_comment_id=4358347#gistcomment-4358347
Digging deeper in the GNU tar
behaviour with pre-existing symlinks, it seems that tar
follows a pre-existing symlink when the file foo/bar
is not proceeded by a tarfile member foo
of type directory. If there is such a member, the symlink is replaced by a new folder foo
and bar
is placed inside it. A symlink foo --> /tmp
is extracted without complains. Only when followed by a member using this symlink an error Cannot open: Not a directory
is printed. (This error message is strange as tar
happily writes foo/bar
through a pre-existing symlink if there is no member foo
before in the archive. The error message is also printed when there is a symlink in the archive matching the pre-existing symlink.)
How does GNU tar
do this? I think it must be storing all folders and symlinks in memory as clearly the tarfile members are not evaluated in isolation and the state of the filesystem is not sufficient to explain behaviour differences. Do we want this level of protection against symlink attacks where the target folder is non-empty (for empty target folders, we can simply check paths with realpath()
) at the expense of possibly high memory usage during extraction?
Note that the symlink trick can be applied also to file symlinks; the tar file created by:
import tarfile
from pathlib import Path
with tarfile.open('foo.tar', 'w') as tf:
ti = tarfile.TarInfo('foo')
ti.type = tarfile.SYMTYPE
ti.linkname = '../../../etc/passwd'
tf.addfile(ti)
Path('foo').write_text('garbage')
tf.add('foo')
will try to write passwd when extracted by extractall
. It seems that tar x
doesn't have this problem. The following naive fix plugs this hole (only tested manually):
index 169c88d63f..aa4513e510 100755
--- a/Lib/tarfile.py
+++ b/Lib/tarfile.py
@@ -2211,6 +2211,8 @@ def makefile(self, tarinfo, targetpath):
source = self.fileobj
source.seek(tarinfo.offset_data)
bufsize = self.copybufsize
+ if os.path.lexists(targetpath) and not os.path.isdir(targetpath):
+ os.unlink(targetpath)
with bltn_open(targetpath, "wb") as target:
if tarinfo.sparse is not None:
for offset, size in tarinfo.sparse:
(needs some more work, like a test, plus I suppose makefifo
/makedev
also need a corresponding fix)
FWIW, there is a spam bot PR campaign by @TrellixVulnTeam and @Kasimir123 (they should be banned) that has been going on since September and spams a large number of python projects with junk PRs with incorrect patches.
Examples include:
Hi All,
My name is Jonathan Leitschuh, I'm a Senior Software Security Researcher for Project Alpha Omega at the Open Source Security Foundation under the Linux Foundation.
I recently attended a talk by @Kasimir123 at ShmooCon about this vulnerability and this work. They recently opened 61,895 pull requests across the OSS ecosystem to fix this vulnerability as the Python maintainers have taken no action to fix this severe vulnerability after 6 years.
To be clear, vulnerabilities like TAR-slip (aka Zip-slip) can lead to remote code execution by allowing an attacker to write code into executable file locations. In the right contexts, this vulnerability has a CVSSv3.1 score of 10/10 (AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H).
While I believe that fixing the simlink problem is important. I believe that this should be broken down into two issues:
Given the first is far more likely to be exploited in the real world, I propose fixing it first, then focusing on the second problem.
This will ensure that end-users are protected against the most likely risk sooner, while the less likely risk discussion continues.
To provide an incentive: I would also like to note that any developer who resolves this security issue in the python standard library would most likely be eligible for a multi-thousand dollar financial reward under the Secure Open Source Rewards (https://sos.dev) project which is also operated out of the Open Source Security Foundation.
FWIW, I'm reading up on this issue, and I plan to post a proposal on Discourse this week to unblock things.
@encukou We have multiple proposals in the comments here and the linked issues. The problem is that no proposal has received positive feedback from the core cpython team to motivate one of us go to the next step of implementing it and making a PR. Since a previous PR fixing the issue has been rejected, rewards from a new PR (financial or not) are uncertain. If you make yet another proposal please link to it from here.
Yeah, it would be unfair if I ended up getting money for this, outside my normal salary. Any solution will necessarily be based on lots of prior work and discussion, by people who seem to have burned out trying to fix the issue. I won't apply for any bounty, and if I get one anyway I'll donate it to the PSF.
My proposal: https://discuss.python.org/t/23149
PEP-706, which adds filter
rather than absolute_path
, has been implemented in #102950. See the added docs.
Python 3.12, and security updates to some earlier releases, will allow users to avoid CVE-2007-4559 by changing their code/settings. Python 3.12 will emit a warning urging people to do that. Python 3.14 will fix CVE-2007-4559 by making safer behaviour the default.
@encukou have I missed something or do these changes now break shutil.unpack_archive abstraction layer for zipfile vs tarfile?
Yes. Sadly, in Python 3.12 and 3.13 you'll get a warning about changing behaviour if you use shutil.unpack_archive with a tarball.
You can ignore the warning, or inspect the archive and pass filter='data'
if it's a tarball. Or if you can change settings of the whole Python interpreter, set tarfile.TarFile.extraction_filter = tarfile.data_filter
.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-security', 'library', '3.11']
title = '[Security] tarfile: Add absolute_path option to tarfile, disabled by default'
updated_at =
user = 'https://github.com/vstinner'
```
bugs.python.org fields:
```python
activity =
actor = 'ned.deily'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'vstinner'
dependencies = []
files = []
hgrepos = []
issue_num = 29788
keywords = []
message_count = 2.0
messages = ['289388', '289437']
nosy_count = 5.0
nosy_names = ['lars.gustaebel', 'vstinner', 'jwilk', 'berker.peksag', 'martin.panter']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'security'
url = 'https://bugs.python.org/issue29788'
versions = ['Python 3.11']
```