pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.47k stars 3.01k forks source link

pip overwrites existing files unconditionally during installation #4625

Open davidedelvento opened 7 years ago

davidedelvento commented 7 years ago

Description:

I'm installing packages with with pip, and it happens that some of these packages (with different names) have files with name clashes. Unfortunately, pip silently ignores the issue and the results depend on the order in which the installs happens. In an ideal world, I would expect this to be impossible, i.e. to force the package maintainer to use a unique namespace. As a (far worse, but better than current behavior) second choice, I would expect pip to warn the user before silently overwriting existing files, especially if pip itself installed those files.

What I've run:

Create a clean slate to work on:

/tmp $ virtualenv NAMECLASH
New python executable in NAMECLASH/bin/python
Installing setuptools, pip...done.
/tmp $ cd NAMECLASH
/tmp/NAMECLASH $ source bin/activate
(NAMECLASH)/tmp/NAMECLASH $ pip show -f pyjwt
(NAMECLASH)/tmp/NAMECLASH $ pip show -f jwt
(NAMECLASH)/tmp/NAMECLASH $ 

Install packages:

(NAMECLASH)/tmp/NAMECLASH $ pip install jwt
Downloading/unpacking jwt
  Downloading jwt-0.5.2.tar.gz
  Running setup.py (path:/tmp/NAMECLASH/build/jwt/setup.py) egg_info for package jwt

Downloading/unpacking cryptography==1.7.2 (from jwt)
  Downloading cryptography-1.7.2.tar.gz (420kB): 420kB downloaded
  Running setup.py (path:/tmp/NAMECLASH/build/cryptography/setup.py) egg_info for package cryptography
...
Successfully installed jwt cryptography typing idna pyasn1 six setuptools enum34 ipaddress cffi pycparser
Cleaning up...
(NAMECLASH)/tmp/NAMECLASH $ 

Lots of irrelevant stuff removed from log above (for the sake of making this issue easier to read). Analyzing what has been installed:

(NAMECLASH)/tmp/NAMECLASH $ pip show -f jwt
---
Name: jwt
Version: 0.5.2
Location: /tmp/NAMECLASH/lib/python2.7/site-packages
Requires: cryptography, typing
Files:
  ../jwt/jws.py
  ../jwt/jwk.py
  ../jwt/exceptions.py
  ../jwt/jwkset.py
  ../jwt/utils.py
  ../jwt/jwa.py
  ../jwt/jwt.py
  ../jwt/__init__.py
  ../jwt/jws.pyc
  ../jwt/jwk.pyc
  ../jwt/exceptions.pyc
  ../jwt/jwkset.pyc
  ../jwt/utils.pyc
  ../jwt/jwa.pyc
  ../jwt/jwt.pyc
  ../jwt/__init__.pyc
  ./
  PKG-INFO
  requires.txt
  SOURCES.txt
  dependency_links.txt
  top_level.txt

(NAMECLASH)/tmp/NAMECLASH $ grep Invalid /tmp/NAMECLASH/lib/python2.7/site-packages/jwt/exceptions.py
class InvalidKeyTypeError(JWTException):

Now install silently conflicting package, and analyzing what happened:

(NAMECLASH)/tmp/NAMECLASH $ pip install pyjwt
Downloading/unpacking pyjwt
  Downloading PyJWT-1.5.2-py2.py3-none-any.whl
Installing collected packages: pyjwt
Successfully installed pyjwt
Cleaning up...
(NAMECLASH)/tmp/NAMECLASH $ pip show -f pyjwt
---
Name: PyJWT
Version: 1.5.2
Location: /tmp/NAMECLASH/lib/python2.7/site-packages
Requires: 
Files:
Cannot locate installed-files.txt
(NAMECLASH)/tmp/NAMECLASH $ grep Invalid /tmp/NAMECLASH/lib/python2.7/site-packages/jwt/exceptions.py
class InvalidTokenError(Exception):
class DecodeError(InvalidTokenError):
class ExpiredSignatureError(InvalidTokenError):
class InvalidAudienceError(InvalidTokenError):
class InvalidIssuerError(InvalidTokenError):
class InvalidIssuedAtError(InvalidTokenError):
class ImmatureSignatureError(InvalidTokenError):
class InvalidKeyError(Exception):
class InvalidAlgorithmError(InvalidTokenError):
class MissingRequiredClaimError(InvalidTokenError):
InvalidAudience = InvalidAudienceError
InvalidIssuer = InvalidIssuerError

As you can see, pip silently allow the two packages to overwrite each other's jwt/exceptions.py file (and in particular the InvalidKeyTypeError is now gone, creating a whole lot of problem, depending on the order in which pip install ran)

darr1s commented 7 years ago

Is there a workaround for now?

RonnyPfannschmidt commented 7 years ago

the name clash pretty much is the fault ofthe package authors ^^ pip cant help bad packagers and pypi is not curated

davidedelvento commented 7 years ago

@RonnyPfannschmidt So pip is so dumb that it cannot even check if it's overwriting files (better if it checked the directory/egg/package exist)? You are correct that pip can't help with name clashes "in the wild", but in my opinion it must warn the user when a name clash is happening on the user machine.

Especially because pip itself, during automatic dependencies resolving, could install a package that the user did not explicitly request and clash with another one.

pradyunsg commented 7 years ago

The fact that the names clash is a packaging issue. The fact that pip installs such packages is (arguably) a pip bug.

I agree that pip should be printing a warning; I won't want pip to refuse to overwrite files and fail... Off the top of my head, I imagine this would break namespace package installation (or at least need them to special cased).


Is there a workaround for now?

I imagine the only "fix" to this issue is that the conflicting packages resolve such a conflict.


To me, this issue as a tracker for a warning being added to pip, that's printed when pip overwrites an existing file, when it doesn't expect it to happen.

RonnyPfannschmidt commented 7 years ago

Any case of overwriting a file that is not a namespace declaration is likely a packaging bug

pradyunsg commented 7 years ago

Any case of overwriting a file that is not a namespace declaration is likely a packaging bug

Agreed.

pip warning about the fact that it's overwriting is nicer than pip silently breaking the installation; even though the flaw is in the package... pip is the one taking the action that breaks the environment, even if it's because of the flawed packaging.

pradyunsg commented 7 years ago

FTR; The fix for this issue IMO would be that pip starts printing warnings when it overwrites non-namespace declaration files.

I feel that until a PR comes around, it doesn't make sense to discuss the ramifications of this change.

pfmoore commented 7 years ago

Agreed. There's a whole load of corner cases / unusual usages that would have to be considered (upgrades, ``--target``` installs, for example) and it's much easier to debate details like that with actual code.

tlandschoff-scale commented 6 years ago

We just got bitten by this - again. I finally researched for an issue for this as I expected that this is just an oversight and will be fixed anyway, if I report it or not.

Consider using two library packages using ipaddress but because we are still at Python 2 a backport is used. There is py2-ipaddress and ipaddress on PyPI. So one library package uses the first and the second the latter (different google filter bubbles, I guess).

Usually this worked out like this (without using the intermediate packages):

torsten.landschoff@horatio:~$ virtualenv --python python2 pip-nameclash
[...]
Installing setuptools, pip, wheel...done.
torsten.landschoff@horatio:~$ . pip-nameclash/bin/activate
(pip-nameclash) torsten.landschoff@horatio:~$ pip install ipaddress
[...]
Successfully installed ipaddress-1.0.18
(pip-nameclash) torsten.landschoff@horatio:~$ pip install py2-ipaddress
[...]
Successfully installed py2-ipaddress-3.4.1
(pip-nameclash) torsten.landschoff@horatio:~$ python -c "import ipaddress;print(ipaddress.ip_address('127.0.0.1'))"
127.0.0.1

However, with less luck this is the result:

torsten.landschoff@horatio:~$ . pip-nameclash/bin/activate
(pip-nameclash) torsten.landschoff@horatio:~$ pip install -q py2-ipaddress
(pip-nameclash) torsten.landschoff@horatio:~$ pip install -q ipaddress
(pip-nameclash) torsten.landschoff@horatio:~$ python -c "import ipaddress;print(ipaddress.ip_address('127.0.0.1'))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/torsten.landschoff/pip-nameclash/lib/python2.7/site-packages/ipaddress.py", line 165, in ip_address
    ' a unicode object?' % address)
ipaddress.AddressValueError: '127.0.0.1' does not appear to be an IPv4 or IPv6 address. Did you pass in a bytes (str in Python 2) instead of a unicode object?

In reality we build using fresh virtualenvs each time and it seems like the installation order is deterministic. So our customers weren't affected. :relieved:

However we spent quite some time to find out what as going on when running the application on my machine, because somehow I got the dependencies installed in the wrong order.

I could understand that this feature is missing if pip did not track the installed files. So I looked if maybe pip does already record this information:

(pip-nameclash) torsten.landschoff@horatio:~/pip-nameclash$ find -type f -print0|xargs -0 grep ipaddress.py
Binary file ./lib/python2.7/site-packages/pip/_vendor/ipaddress.pyc matches
Binary file ./lib/python2.7/site-packages/ipaddress.pyc matches
./lib/python2.7/site-packages/pip-9.0.1.dist-info/RECORD:pip/_vendor/ipaddress.py,sha256=wimbqcE7rwwETlucn8A_4Qd_-NKXPOBcNxJHarUoXng,80176
./lib/python2.7/site-packages/pip-9.0.1.dist-info/RECORD:pip/_vendor/ipaddress.pyc,,
./lib/python2.7/site-packages/ipaddress-1.0.18.dist-info/RECORD:ipaddress.py,sha256=0U5GvYMuyiZrV5jScIRHYxFWbwLzM44HNiaY8x57qUA,80156
./lib/python2.7/site-packages/ipaddress-1.0.18.dist-info/RECORD:ipaddress.pyc,,
./lib/python2.7/site-packages/py2_ipaddress-3.4.1.dist-info/RECORD:ipaddress.py,sha256=lCp63yE4Z1-ia3T3qEiljD1CcG10UartIx3nYXXoRzE,74200
./lib/python2.7/site-packages/py2_ipaddress-3.4.1.dist-info/RECORD:ipaddress.pyc,,

Surprise, it is recorded - in a file named RECORD :-)

Question is: Would a merge request be accepted that adds a check for conflicting files? I'd be interested in implementing this as I carry a bit of fear that we could run into this in production.

For now I will be trying to come up with a script that checks the installed files vs. the RECORD information.

pfmoore commented 6 years ago

Question is: Would a merge request be accepted that adds a check for conflicting files?

Yes, a PR would be welcomed. As noted earlier in the thread, please make sure to cater for namespace packages (I don't know much about them myself, but a lot of people use them and they need to be supported).

tlandschoff-scale commented 6 years ago

As noted earlier in the thread, please make sure to cater for namespace packages

Good point. And probably non-trivial. Naive solution would be to ensure that all __init__.py have the same content (or overall allow conflicting files with identical content).

But I bet that will often not be the case.

RonnyPfannschmidt commented 6 years ago

oh, and as a extra fun bit, pip will have to keep track of double installed files if 2 namespace packages install a __init__ and one is uninstalled, the __init__ shouldnt disappear

pfmoore commented 6 years ago

... and at the moment, pip doesn't actually keep track of what's installed. That's done in the RECORD file which is technically package metadata (and so can't be modified for pip's private purposes). There's also the need to deal with packages that weren't installed by pip (e.g. raw setup.py install was used, or conda).

Sorry @tlandschoff-scale - I don't mean to dump loads of objections on you. It would still be good to have a PR for this, but as you can see, it's not as simple as it first seems. Unfortunately, that's far too common a situation in Python packaging :disappointed:

davidedelvento commented 6 years ago

@pfmoore I think packages installed in other ways (e.g. python setup.py install) are totally out of scope here. The issue here is: pip happily overwrites packages it installed, that's the bug. Tracking things installed in other ways could be a nice feature to have, but it's not as dramatically wrong as pip overwriting packages it installed without noticing!

pfmoore commented 6 years ago

Understood, but my point is that pip doesn't do any tracking of what it installed. It relies purely on standardised metadata to tell what files are installed and where (specifically the RECORD file). And that's essentially by design - we want package data (such as what is installed) to be introspectable by anyone, using well-defined standard metadata.

pfmoore commented 6 years ago

... anyway, it's not worth debating the details just yet. Let's wait for someone to submit a PR and we can discuss in the context of actual code.

tlandschoff-scale commented 6 years ago

From what I have found the RECORD file is actually rewritten during installation - but only when a wheel is installed. Which would probably account for 99% of all packages we are installing.

Actually this is done inside the (vendored) distlib package in src/pip/_vendor/distlib/wheel.py. So pip may not do any tracking but some tracking information is (usually) available.

pradyunsg commented 6 years ago

I agree with @pfmoore here. I think it's best to not discuss the details now and to discuss them in the context of some actual implementation, when a PR is made for this by someone.

mhsmith commented 6 years ago

There's a whole load of corner cases / unusual usages that would have to be considered (upgrades, ``--target``` installs, for example)

In the case of --target, things are even more confusing: it doesn't treat existing files and directories the same way as a normal install, because it only works at the granularity of top-level files and directories (#4230).

Like most of the other issues with --target, this behaviour seems to exist, not because it makes sense, but because it was the quickest solution to a problem someone had in the past. As it says in https://github.com/pypa/pip/issues/1489#issuecomment-53972933:

Multiple installs don't say "this is already there" because it isn't. Upgrade won't work because it's not there to upgrade. Editable would probably be a mess, I have no idea how it would work. "Fragile" is the most polite term I can think of for the behaviour...

anntzer commented 6 years ago

Just an observer's comment regarding

the name clash pretty much is the fault ofthe package authors ^^ pip cant help bad packagers and pypi is not curated

Arch Linux's packages are split in two parts: the "official" packages (https://www.archlinux.org/packages/) and the user-provided & uncurated packages (the Arch User Repositories, https://aur.archlinux.org/ https://wiki.archlinux.org/index.php/Arch_User_Repository). The package manager, pacman, can install packages from either list, and does refuse, by default, to let one package overwrite files already present in the system. Of course, upgrades are handled fine.

So lack of curation is not really an argument there IMO.

mhsmith commented 6 years ago

The package manager, pacman, [...] does refuse, by default, to let one package overwrite files already present in the system.

Conda has the same policy.

ncoghlan commented 6 years ago

(Arriving from https://github.com/pypa/python-packaging-user-guide/pull/513)

As noted there, I think more desirable behaviour here would be for pip to:

  1. When installing from a wheel, check whether or not there's an existing local RECORD for the package it is installing
  2. If there is, read it, and emit FutureWarning if it needs to overwrite a file not listed in RECORD (as that's a sign the file actually belongs to a different package)
  3. In a later release (after the assorted edge cases have been worked through), switch that warning to a hard failure of the installation attempt and require a --force option to override it

As a UX enhancement, pip could potentially start caching an SQLite DB somewhere with the reverse lookup from filenames to the corresponding RECORD file, and use that to report which packages are responsible for a file path conflict, but that wouldn't be essential to the base feature of conflict detection.

(Such a DB would also let pip more gracefully handle the shared __init__ case for namespace packages)

pradyunsg commented 6 years ago

@ncoghlan's suggested transition looks good to me.

dhellmann commented 4 years ago

The patch in #8119 uses a slightly different implementation than @ncoghlan suggested. I noticed that the 2 packages in the original bug report do actually have different distribution names, so they end up with different metadata directories and RECORD files. So, I added logic to look for all other distributions with files under the same top level directory as the new wheel and report conflicts, including information about the owning distribution.

pradyunsg commented 4 years ago

8119 is a proposed fix for this issue.

pradyunsg commented 1 year ago

Here's a different idea: The only legitimate usecase we know of, for this, is non-native namespace packages.

In that case, we can reasonably restrict this to only allow overwriting __init__.py files (with the relevant deprecation period) and introduce an error if the file being overwritten isn't an __init__.py file. If we really want, we could go further and restrict the file lengths and also enforce an import of pkgutil or pkg_resources and nothing else via AST parsing.

uranusjr commented 1 year ago

But this still leaves the problem that when one of the packages gets uninstalled, it would break other packages that share the same __init__.py.

PythonCHB commented 1 year ago

Coming to this very late (from: https://discuss.python.org/t/python-module-conflict-discussion/23659)

https://discuss.python.org/t/python-module-conflict-discussion/23659

I think it could be kept fairly simple, taking into account that there may be other sources than PyPi for packages, and there may be tools other than pip for installing.

AFAICT, there are two different potential use cases here:

1) Two different distributions that have nothing to do with each-other use the same package name / same filename. This is fundamentally a problem with uncurated package sources, but at least the user should know something’s up, and have to use a --force flag or some such to continue the install.

2) Cooperating distributions, such as namespace packages – then maybe pip could know that it’s OK to overwrite certain files -- in that case, it would be part of the distribution being installed saying: "I expect that I might be overwriting this file -- and that's OK". This, of course, would open a door to malevolent behavior, but I don't think that's what we're trying to prevent at this point.

NOTE: namespace packages may not be the only use-case -- one I can think of off the top of my head are optional "accelerated" versions of packages. I might want the accelerated version to overwrite the "regular" version in some way.

zooba commented 1 year ago

Any regular upgrade process is going to remove the old package first. There's literally no reason for a package install to need to overwrite any existing file, and the patch ought to be as simple as banning overwrites when extracting.[^1]

The only exception - and it would be part of error handling the failed overwrite, not an eager test - is to allow overwrites for identical contents. The only somewhat legitimate case right now is multiple cooperating packages trying to form a namespace that want to include the __init__.py, and all of those ought to have identical contents (ideally empty, but likely still using the command popularised during setuptools's per-distro install directories). These ought to be identical, and that's the only reasonable property we can test anyway.

So, in the absence of the --force option:

We don't need to fix uninstalls here - they're already busted. Most people I know don't ever uninstall stuff anymore, they just blow away the whole environment and start fresh. In any case, there are workarounds for it, and if pip starts warning on overwrites it'll be even more clear that it's because uninstalls won't work properly.

[^1]: Bearing in mind that the extract/copy process is supposedly going to be completely rearchitected, which is why I stopped my work on optimising it and am not intending to write a patch for this right now.

pfmoore commented 1 year ago

There's --ignore-installed, and while you may disagree with its existence, we have to be aware of it 🙁

And I disagree on uninstalls, as a general concept. If you're simply talking about uninstalls of namespace packages, then fair enough (I don't care much either way) but uninstalls in general are definitely an important thing to have, and "just blow away the environment" isn't a reasonable response to brokenness of uninstalls in general.

zooba commented 1 year ago

There's --ignore-installed, and while you may disagree with its existence, we have to be aware of it 🙁

I wasn't even aware 😄 But I see no reason to not treat that like --force in this context.

uninstalls in general are definitely an important thing to have, and "just blow away the environment" isn't a reasonable response to brokenness of uninstalls in general.

Maybe unreasonable, but it's what people do. That's the only point I was making. I never suggested further breaking or removing uninstall.

All I was really pointing out is that this change doesn't break uninstalls further, so there's no reason to block it on those grounds. (Also worth noting that to properly fix uninstalls, we need to solve the conflict problem first, so this one comes first anyway.)

pradyunsg commented 10 months ago

From a conversation with @jaimergp at PackagingCon today, the way Conda has a flag for how to handle PathConflict:

We can probably add an option like that.

pradyunsg commented 10 months ago

I'm gonna do a bit of a rework of the wheel installation logic, once https://github.com/pypa/installer/issues/153 is completed.

chinaexpert1 commented 7 months ago

Sorry I'm a bit green but how would one know this has occurred prior to some behavior? Is there something I can run before installation to avoid this? And is the environment ruined if I don't? Thanks

bezborodow commented 2 weeks ago

Is this a security risk in that a malicious package can purposefully overwrite a file in another package?

zooba commented 2 weeks ago

It's a security risk if you don't validate the contents of the package you're intending to install.

Whether it then installs malicious code, runs a malicious build-time script, or overwrites existing files is equivalent risk.