Deprecate and remove code execution in pth files

warsaw commented 6 years ago

BPO	33944
Nosy	@mhammond, @warsaw, @brettcannon, @terryjreedy, @jaraco, @ncoghlan, @pitrou, @ericvsmith, @tiran, @nedbat, @aroberge, @methane, @ericsnowcurrently, @takluyver, @zooba, @matrixise, @vedgar, @native-api, @yan12125, @asottile, @ethanhs, @csabella, @miss-islington, @chrisjbillington, @qix-
PRs	python/cpython#10131 python/cpython#12107 python/cpython#12110 python/cpython#15942
Dependencies	bpo-14803: Add feature to allow code execution prior to main invocation

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.8', 'type-feature', 'library'] title = 'Deprecate and remove code execution in pth files' updated_at = user = 'https://github.com/warsaw' ``` bugs.python.org fields: ```python activity = actor = 'lkollar' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'barry' dependencies = ['14803'] files = [] hgrepos = [] issue_num = 33944 keywords = ['patch'] message_count = 120.0 messages = ['320246', '320249', '320253', '320266', '320277', '320279', '320283', '320284', '320286', '320287', '320292', '320293', '320342', '320386', '320393', '320724', '320754', '320850', '320997', '321005', '321026', '321125', '321134', '321340', '328488', '328564', '329607', '329764', '329802', '330115', '333235', '333536', '333567', '333568', '333569', '333572', '333591', '333592', '333613', '333637', '333638', '333639', '333640', '333642', '333644', '333645', '333698', '333699', '333705', '333706', '333716', '333997', '334199', '335774', '335926', '336351', '336662', '336705', '336709', '336710', '336711', '336714', '336716', '336721', '336722', '336725', '336726', '336809', '336853', '336856', '336860', '336863', '336875', '336882', '336939', '336944', '336961', '336970', '336983', '336984', '336992', '337064', '337351', '337353', '337354', '337365', '337368', '337370', '337396', '337399', '337406', '337408', '337409', '337410', '337414', '337417', '337418', '337421', '337422', '337424', '337426', '337427', '337430', '337434', '337437', '337438', '337439', '337446', '337920', '337954', '350625', '351861', '351872', '358909', '358915', '358953', '368712', '368732', '371334', '384148'] nosy_count = 31.0 nosy_names = ['mhammond', 'barry', 'brett.cannon', 'terry.reedy', 'jaraco', 'ncoghlan', 'pitrou', 'eric.smith', 'christian.heimes', 'nedbat', 'aroberge', 'ionelmc', 'methane', 'SilentGhost', '__Vano', 'eric.snow', 'takluyver', 'steve.dower', 'matrixise', 'veky', 'Ivan.Pozdeev', 'yan12125', 'Anthony Sottile', 'Michel Desmoulin', 'ethan smith', 'cheryl.sabella', 'lkollar', 'miss-islington', 'Chris Billington', 'Peter L3', 'qix-'] pr_nums = ['10131', '12107', '12110', '15942'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue33944' versions = ['Python 3.8'] ```

e9052a66-9d25-42b8-b701-20e674176c81 commented 5 years ago

Linux distros approach to handling this is terrible because they dump all their system packages into a single global site-packages, leading to the every growing sys.path problem that Barry is concerned about.

However, that's entirely the fault of distro packaging policies, and can be remedied in a far superior way by switching distros to a model where they create a venv per application, and then use .pth files to link in the system packages that they actually want visible to that application.

I'm curious about this since it doesn't make sense to me. Dumping all packages at the top level in /usr/lib/pythonX.Y/site-packages means exactly zero .pth files. Wouldn't putting each module in its own directory, with all the directories necessary for a given app added to the path of a venv for that app mean strictly more .pth files, and a sys.path as long as the list of dependencies for that app? Whilst this would certainly be more flexible for keeping multiple versions of packages around as required by different apps, I don't see that it would decrease startup time at all - more folders need to be searched for each import, not less, and a recursive hierarchy of .pth files would need to be parsed at startup as each package pulled in the directories of its own dependencies. A flat structure like most linux distros use would seem like it would be as efficient as you could get, unless you think that searching through a larger list of strings for the right one is slower than opening a tree of .pth files.

fe5a23f9-4d47-49f8-9fb5-d6fbad5d9e38 commented 5 years ago

I have a directory inside my home directory, and inside it I have files with various utilities I have written over the years. So far, whenever I have installed a new version of Python, I have simply put a util.pth into site-packages. If you remove that possibility, what am I supposed to do? Every other solution is either much more complicated, or doesn't enable me to evolve my utilities inplace, or both. What am I missing? (My OS is Windows, and shortcuts don't work, I've tried.)

57a92b3d-4509-44f6-960d-017b524a2a96 commented 5 years ago

FYI I have 3 projects that use pth files to activate various features (an env var is usually the trigger):

https://pypi.org/project/pytest-cov - enables coverage measurement in any subprocess https://pypi.org/project/manhole - installs a debug service https://pypi.org/project/hunter - installs a tracer

I wouldn't like them being rendered almost or completely useless by such a hasty change.

Running stuff during startup can be problematic and tricky, for example I have painfully found out that on python 2.7 you can completely hose up your codecs registry if you try to decode things during startup (before the registry is fully built) but I think it's a fair price for such a powerful feature.

csabella commented 5 years ago

Hello all,

There was a lot of traction on this discussion a month ago and I was wondering if any updates/expectations should be set? Specifically:

There is a PR for a doc change that Terry approved, but wanted another core dev to look at. If there is agreement on the doc change, then perhaps it can be merged for 3.8? If not, then perhaps it can be closed?
There was discussion about creating a PEP and I believe Barry, Jason, and possibly Nick said they wanted to work on it. Has more work been done on that? I'm not trying to push anyone, but I saw on other threads about the virtual whiteboard group being created to get some traction on ideas before PyCon, so I just wanted to put this back on the radar in case you wanted it to generate discussion at the language summit.
I realize that PEPs are needed for any change and even to define what that change might look like, but is there any value in adding PendingDeprecationWarnings for 3.8 if that's a possible action that will happen? As I understand it, it would be easier to remove that warning later instead of delaying any actions from it.

Thanks!

vstinner commented 5 years ago

I realize that PEPs are needed for any change and even to define what that change might look like, but is there any value in adding PendingDeprecationWarnings for 3.8 if that's a possible action that will happen? As I understand it, it would be easier to remove that warning later instead of delaying any actions from it.

We cannot modify Python before a PEP is approved. It's too early to see that a PEP removing support for .pth file will be approved or not. There are too many constraints and use cases.

zooba commented 5 years ago

I took a look at the docs PR, and honestly I don't even get what the "intended" uses of executable code are supposed to be.

The examples are "load 3rd-party import hooks, adjust PATH variable", but the only cases I can think of where you'd need to do these in a .pth file is where your module is a single file. As soon as you have a package with __init__.py, you have a file that can do exactly the same modifications before the module that needs it is imported.

I'd be inclined to limit the doc change to not provide any "valid" uses for this, and just discourage doing anything that takes a long time (most of the text in the PR is fine, IMHO).

And yeah, I'd like to see the arbitrary code execution "feature" removed too.

As for .pth files in general, I'm interested in the scenarios that caused Barry to have to do difficult debugging where "python -m site" wasn't able to help. If they all involved arbitrary code execution, then let's take out the right tumor. But if they somehow manipulated sys.path in a way that looking at sys.path doesn't reveal, then I'd like to know how.

ncoghlan commented 5 years ago

Yep, I completely understand (and agree with) the desire to eliminate the code injection exploit that was introduced decades ago by using exec() to run lines starting with "import " (i.e. "import sys; \<arbitrary code goes here>").

I just don't want to lose the "add this location to sys.path" behaviour that exists for lines in pth files that *don't* start with "import ", since that has plenty of legitimate use cases, and the only downside of overusing it is an excessively long default sys.path (which has far more consistent and obvious symptoms than the arbitrary code execution case can lead to).

warsaw commented 5 years ago

On Feb 26, 2019, at 05:19, Nick Coghlan \report@bugs.python.org\ wrote:

I just don't want to lose the "add this location to sys.path" behaviour that exists for lines in pth files that *don't* start with "import ", since that has plenty of legitimate use cases, and the only downside of overusing it is an excessively long default sys.path (which has far more consistent and obvious symptoms than the arbitrary code execution case can lead to).

It’s also very difficult to debug because pth loading usually happens before the user has a chance to intervene with a debugger. This means mysterious things can happen, like different versions of a package getting imported than you expect.

Extending sys.path is a useful use case, but doing so in pth files is problematic.

zooba commented 5 years ago

Extending sys.path is a useful use case, but doing so in pth files is problematic.

There are 100 other ways to end up in this situation though. Why is *this* one so much worse?

Can you offer an issue you hit that was caused by a .pth file that *wasn't* debuggable by listing sys.path?

warsaw commented 5 years ago

On Feb 26, 2019, at 12:32, Steve Dower \report@bugs.python.org\ wrote:

There are 100 other ways to end up in this situation though. Why is *this* one so much worse?

Because there’s no good place to stick a pdb/breakpoint to debug such issues other than site.py, and that usually requires sudo.

Can you offer an issue you hit that was caused by a .pth file that *wasn't* debuggable by listing sys.path?

I don’t remember the details, but yes I have been caught in this trap. The thing is, by the time user code gets called, the damage is already done, so debugging is quite difficult.

This will be alleviated at least partially by deprecating the executing of random code. Maybe just allowing sys.path hacking will be enough to make it not so terrible, especially if e.g. (and I haven’t check to see whether this is the case today), python -v shows you exactly which .pth file is extending sys.path.

The issue is discoverability. Since pth files happen before you get an interpreter prompt, it’s too difficult to debug unexpected, wrong, or broken behavior. My opposition would lessen if there were clear ways to debug, and preferably also prevent, pth interpretation.

57a92b3d-4509-44f6-960d-017b524a2a96 commented 5 years ago

Because there’s no good place to stick a pdb/breakpoint to debug such issues other than site.py, and that usually requires sudo.

Something bad was installed with sudo but suddenly sudo is not acceptable for debugging? This seems crazy.

How exactly are pth files hard to debug? Are those files hard to edit? They sure are, but the problem ain't the point where they are run, it's the fact that a big lump of code is stuffed on a single line. Lets fix that instead!

I've written pth files with lots of stuff in them, and my experience is quite the opposite - they help with debugging. A lot. It's an incredibly useful python feature.

I don’t remember the details, but yes I have been caught in this trap.

Maybe if you remember the details we can discuss what are the debugging options, and what can be improved.

cbf13ede-eda8-4246-abee-98732ce73413 commented 5 years ago

On 26.02.2019 23:37, Barry A. Warsaw wrote:

My opposition would lessen if there were clear ways to debug, and preferably also prevent, pth interpretation.

Easy. Insert a chunk into site.py that would call pdb.set_trace() if an envvar (e.g. `PYSITEDEBUG') or a command line switch is set.

Actually, why can't whoever has this problem add such a chunk themselves? Is this really such a frequent and ubiquitous problem that this needs to be in the stock codebase? I suspect we're dealing with a vocal minority here.

zooba commented 5 years ago

Barry is a steering council member now, so by definition he's 1/5th of the loudest possible minority ;)

I am totally okay with adding more diagnostics here. Frankly, if "-v" doesn't currently log info about .pth files (or other things that the site module does when it's active) then we should just do that.

warsaw commented 5 years ago

On Feb 26, 2019, at 12:52, Ionel Cristian Mărieș \report@bugs.python.org\ wrote:

Something bad was installed with sudo but suddenly sudo is not acceptable for debugging? This seems crazy.

Your sudo may not be my sudo. :) Let’s say I update my Ubuntu desktop and a new version of package with a pth breaks. Maybe I didn’t even know I was doing that, via automated updates, or management portal, etc. Now a poor user who depends on this has their code break. How do *they* debug the problem?

FWIW, sudo pip install should just be banned IMHO :).

How exactly are pth files hard to debug? Are those files hard to edit? They sure are, but the problem ain't the point where they are run, it's the fact that a big lump of code is stuffed on a single line. Lets fix that instead!

For sure. But here’s the thing: you need to know *which* pth file is problematic. Which means you have to debug the entire startup process where pth files are loaded. That means you’re not really debugging pth files themselves (often), but site.py. Debugging site.py for an installed Python is not trivial. Hopefully you are at least not squeamish about editing a system file and breaking Python worse than the original bug. \<wink>

warsaw commented 5 years ago

On Feb 26, 2019, at 13:23, Ivan Pozdeev \report@bugs.python.org\ wrote:

Easy. Insert a chunk into site.py that would call pdb.set_trace() if an envvar (e.g. `PYSITEDEBUG') or a command line switch is set.

Actually, why can't whoever has this problem add such a chunk themselves? Is this really such a frequent and ubiquitous problem that this needs to be in the stock codebase? I suspect we're dealing with a vocal minority here.

Basically yes, I’ve done this. But think of the poor user who doesn’t have that expertise or ability to hack on an installed Python’s site.py file. When their application breaks because some faulty pth was installed behind their back, how do they debug their application when the breakage has already occurred before Python even gets to their code? How do they answer questions like “where did that magical sys.path entry come from?” or “how did that module get in sys.modules already?”

57a92b3d-4509-44f6-960d-017b524a2a96 commented 5 years ago

On Wed, Feb 27, 2019 at 1:31 AM Barry A. Warsaw \report@bugs.python.org\ wrote:

Your sudo may not be my sudo. :) Let’s say I update my Ubuntu desktop and a new version of package with a pth breaks. Maybe I didn’t even know I was doing that, via automated updates, or management portal, etc. Now a poor user who depends on this has their code break. How do *they* debug the problem?

Well that's easy:

update my Ubuntu desktop -> stuff breaks -> rollback/downgrade
automated updates -> stuff breaks -> stop using them, and learn lesson ;)
management portal -> stuff breaks -> complain to sysadmin

Desktop users don't need to debug problems, devs/sysadmins do. They have sudo.

FWIW, sudo pip install should just be banned IMHO :).

Lets also ban ctypes and threads right? :)

For sure. But here’s the thing: you need to know *which* pth file is problematic. Which means you have to debug the entire startup process where pth files are loaded.

How many pth files could one have? 2-3 ... 5 at most. Just locate .pth and rename the biggest one till the problem goes away.

57a92b3d-4509-44f6-960d-017b524a2a96 commented 5 years ago

On Wed, Feb 27, 2019 at 1:41 AM Barry A. Warsaw \report@bugs.python.org\ wrote:

Basically yes, I’ve done this. But think of the poor user who doesn’t have that expertise or ability to hack on an installed Python’s site.py file. When their application breaks because some faulty pth was installed behind their back, how do they debug their application when the breakage has already occurred before Python even gets to their code? How do they answer questions like “where did that magical sys.path entry come from?” or “how did that module get in sys.modules already?”

Aren't these sort of questions answered by using strace python -v or similar? What information is missing more exactly?

055c9f14-66bf-4e51-ac4c-bf91dab64f6d commented 5 years ago

+1 for python -v listing .pth files found and loaded.

For debugging, I just add a: import sys; print('Loading mypth.pth') to the start of the pth file. A plain print doesn't work(?). breakpoint() doesn't work(?). It would be nice to be able to get the filename (file is site.py)

zooba commented 5 years ago

But think of the poor user who doesn’t have that expertise or ability to hack on an installed Python’s site.py file.

This is actually part of the thinking behind the reportabug tool I started (and why when you format it as raw text you get the listing of everything in any directory on sys.path - mostly because I haven't added a Markdown rendering of that). If the answer is to enhance that and tell users "run reportabug mybrokenmodule and send me the output", well, that's why I put it on GitHub :) https://github.com/zooba/reportabug

I see no reason to hold up adding pth logging to -v, so anyone interested please feel free to do a PR.

The only reason I see to hold up PE 10131 (docs update) is because it documents the rationale for using arbitrary code execution in a pth file. Since we clearly want to get rid of it, I don't think we should in any way rationalize it in the docs.

Once these are done, I think we'll have to reevaluate whether .pth files are actually a problem in their normal behavior, and whether the benefit outweighs the cost. But since we're all agreed that they aren't easy to debug and contain features we all want to get rid of, there's not much point using the current state to do the cost/benefit analysis. Let's fix the bits we can fix first and then see where we stand.

341ce0e2-bdbb-4bd4-99e6-746f11201a3f commented 5 years ago

contain features we all want to get rid of

I don't think even this is unanimous. Things like registering codecs, instrumenting coverage in subprocesses, etc. all seem like legitimate uses of the arbitrary code execution feature

warsaw commented 5 years ago

On Feb 28, 2019, at 09:40, Anthony Sottile \report@bugs.python.org\ wrote:

I don't think even this is unanimous. Things like registering codecs, instrumenting coverage in subprocesses, etc. all seem like legitimate uses of the arbitrary code execution feature

Except pth files are a terrible interface for that, given all the other problems, including weird wall-of-code inducing restrictions on what actually gets executed.

I’m in agreement with Steve Dower in principle here. I would like to see a solution that deprecates and eventually removes arbitrary code execution in pth files, leaves sys.path extension alone (for now \<wink>), and improves the discoverability and debuggability of magical pth files.

What I think Anthony is looking for are ways to register “start up functions” that get executed automatically when the Python interpreter starts up. Perhaps somewhat analogous to atexit functions? But if we’re going to officially support a feature like that, I think a PEP would be the right vehicle to suss out all the gory details, like, should these things be global across all invocations of the interpreter, how a user or application would disable that, how would bugs in start up functions get discovered, reported, and debugged, what if any execution order guarantees should be made, etc.

341ce0e2-bdbb-4bd4-99e6-746f11201a3f commented 5 years ago

What I think Anthony is looking for are ways to register “start up functions” that get executed automatically when the Python interpreter starts up

yes, this is what I want to still exist :)

my hope is that there's a clear standards-track replacement *before* deprecating .pth (which currently satisfies my usecases for startup functions)

eb193824-a002-4e53-a73b-be75738ef3f2 commented 5 years ago

On second thought, the inability to debug code that runs at startup, before user code ever gets control, is a fundamental issue (this problem arises for any software that has startup code), so such a facility in stock codebase has a merit.

zooba commented 5 years ago

The sitecustomize.py file is totally available, and the only limitation there is packages can't inject themselves into it on installation. And if you want to trigger it on a package import then you totally can (though there's *another* discussion about that being a bad idea).

.pth files really only satisfy the "run at startup because I'm a dependency of something that my user wants and don't make them opt-in to my changed behaviour", which I don't like :)

If encodings need to be available without an explicit import, sure, we can add a point for those. Import hooks can always be injected by a package __init__.py before the importer will try and resolve the module, so nothing is needed there. But having a PEP with specific use cases to argue about is the way to create new mechanisms here. I don't agree we need a solution before declaring that the old way should be avoided and will eventually be removed, provided we don't add noisy warnings until there's an alternative.

cbf13ede-eda8-4246-abee-98732ce73413 commented 5 years ago

On 01.03.2019 3:58, Steve Dower wrote

Import hooks can always be injected by a package __init__.py before the importer will try and resolve the module, so nothing is needed there.

I thought the flaw in this reasoning in https://bugs.python.org/issue33944#msg320277 was obvious and didn't want to bother people refuting it. Apparently not.

To do anything in __init.py, that __init.py itself needs to be already importable. This very well may not be the case -- in fact, import hooks were designed specifically for the scenarios where this is not the case.

Imagine e.g. loading modules from a cloud storage (why not?) -- so nothing on the system at all except the hook. Or, suggested earlier in this ticket, a union namespace where the code to import needs to be constructed on the fly.

.pth files really only satisfy the "run at startup because I'm a dependency of something that my user wants and don't make them opt-in to my changed behaviour"

Startup code (custom or not) is not a dependency of anything. It rather customizes the environment in which the program specified by the user would run, _before_ any user code could be allowed to get control. It is not a part of the program to be run but rather of the environment that the user wants, and it needs to be implicit so the user can use the same commands and code (compare venv). This is a required feature because the stock Python startup logic cannot possibly provide all the customizations that a user may need (compare initrd).

.pth's are equivalent to sitecustomize but allow the user to manage the set of code chunks automatically using the packaging infrastructure (compare .d directories in UNIX). The fact that this feature is mixed up with and often supplements "real packages" that a program would explicitly use is actually incidental: a package with a .pth does not need to have any functionality intended for explicit use.

which I don't like

If you don't like something, there's always a specific reason -- though you may not understand it consciously. So the way to go is dig into it, find out what specific speck is putting you off -- only then can you be sure that you are concentrating on the right thing and won't throw the baby out with the bathwater. Try to change one trait in your mind's eye leaving all else intact -- will the feeling go away? If it will, you are on the right track; can the trait you chose be split further? You know you found it when you can't change any further part and change the feeling and you can say with confidence how exactly what it's doing misaligns with your moral compass.

We already identified a few real reasons: hard to see, hard to debug, encourages unreadable code, run in arbitrary order when the order matters (and IIRC I provided fixes for all). What else?

cbf13ede-eda8-4246-abee-98732ce73413 commented 5 years ago

On 01.03.2019 20:27, Ivan Pozdeev wrote:

The fact that this feature is mixed up with and often supplements "real packages" that a program would explicitly use is actually incidental: a package with a .pth does not need to have any functionality intended for explicit use.

Eureka! So, there are actually two kinds of packages: "functional packages" to be used explicitly and "environment packages" to customize the execution environment. The infrastructure just doesn't distinguish between them and allows a package to combine both types of functionality for convenience.

By this logic, pywin32's .pth is effectively a private import hook to allow for its nonstandard structure. It could be in a separate "environment package" that would be a dependency but that would complicate things for no real gain.

The caveat with "environment packages" is that there are no predefined dependencies between them and between them and "functional packages". Their required execution order rather depends on user's needs. E.g. the order of import hooks' registration would matter if more than one can serve a specific name, and the user may prefer any of the options; whether some import hook is required to import some installed packages depends on the way they are installed.

This is the same with any other plugin functionality, too. And I'm not aware of any general solution because a solution is very situational. The best we can do here that I see is to allow the user (or, you guessed it, yet another "environment package" for manageability) to specify load order dependencies between .pth's.

341ce0e2-bdbb-4bd4-99e6-746f11201a3f commented 5 years ago

I don't have time to look through the data today but I wrote a script to collect the usages of .pth from pypi. I realized after I ran it that I skipped source distributions with .zip extension but otherwise it's pretty complete:

https://github.com/asottile/pth-file-investigation

There are ~132 packages using .pth features (not including setuptools namespace packages which I had to exclude since there were so many of them). I was planning to classify these but didn't have time to do so.

Some "highlights" from scrolling through the list, two of them are mine (future-breakpoint, future-fstrings), at least one is guido's (pyxl3), ruamel's namespace-packaging appears to use .pth (ruamel.* (12 packages))

warsaw commented 5 years ago

On Mar 1, 2019, at 09:27, Ivan Pozdeev \report@bugs.python.org\ wrote:

Startup code (custom or not) is not a dependency of anything. It rather customizes the environment in which the program specified by the user would run, _before_ any user code could be allowed to get control. It is not a part of the program to be run but rather of the environment that the user wants, and it needs to be implicit so the user can use the same commands and code (compare venv). This is a required feature because the stock Python startup logic cannot possibly provide all the customizations that a user may need (compare initrd).

.pth's are equivalent to sitecustomize but allow the user to manage the set of code chunks automatically using the packaging infrastructure (compare .d directories in UNIX). The fact that this feature is mixed up with and often supplements "real packages" that a program would explicitly use is actually incidental: a package with a .pth does not need to have any functionality intended for explicit use.

We already identified a few real reasons: hard to see, hard to debug, encourages unreadable code, run in arbitrary order when the order matters (and IIRC I provided fixes for all). What else?

The fact that .pth files are global and affect the entire Python installation. That’s not so bad in venvs where we have environmental isolation, but it’s really bad (IMHO) for the global Python interpreter. Right now, there’s no control over the scope of such environmental customizations; it’s all or nothing. Applications should be able to opt in or out of them, just like they can with individual packages (which must be imported in order to affect the interpreter state). The trick then is how to define opt-in for applications *before* the interpreter gets to user code.

zooba commented 5 years ago

Barry's response in https://bugs.python.org/issue33944#msg336970 is exactly what my response to that point was going to be.

Just because I want to use package spam and it wants to use package eggs doesn't mean that eggs gets to enable cloud imports (or anything else similarly magical) automatically. If I want that, it can provide it and tell me to call it in my code, or it can do it when needed. Neither of those options require arbitrary code execution in a .pth file.

cbf13ede-eda8-4246-abee-98732ce73413 commented 5 years ago

On 02.03.2019 2:25, Barry A. Warsaw wrote:

The fact that .pth files are global and affect the entire Python installation. \<...> Right now, there’s no control over the scope of such environmental customizations; it’s all or nothing.

That's the entire purpose of "customizing the environment in which the program specified by the user would run". A customization can very well be implemented to be application-specific but it doesn't have to. Python was never designed to isolate modules from each other (an "application" as you say it is just another module) -- on the contrary, the amount of power it gives the user over the code that they don't control is one of its key appeals. A Python installation acts as a unit where anything can affect anything else, and the order is maintained with https://en.wikipedia.org/wiki/Soft_security .

So, if you need a compartmentalized application, a regular Python installation is a wrong tool for the job. Compartmentalization comes at the price of:

rampant code duplication ('cuz if you actively distrust external code, you have to bring all the code you need with you) and all its corollaries (no automatic security fixes and modernized semantics; no memory and disk space economy from shared library reuse) o so compartmentalization is absolutely impossible within a shared environment: anything that you use can be changed by the user (e.g. to satisfy the requirements of something else, too)
lack of interoperability (how many Android apps do you know that can use each other's functionality?).

Venv does a pretty good job of providing you with a private copy of any 3rd-party modules you require but not the envvars, the interpreter and the standard library (and any OS facilities they depend on). If you require a harder barrier between your app and the rest of the system and/or wish to actively prevent users from altering your application, you'll have to use a private Python installation (e.g. in /opt), or hide it from everyone with the likes of Pyinstaller, or an OS-level container, or a VM... or just drop the pretense and go SaaS(S) (that'll teach those sneaky bastards to mess with my code!).

Applications should be able to opt in or out of them, just like they can with individual packages (which must be imported in order to affect the interpreter state). Right on the contrary. To decide what environment an application shall be run in is the user's prerogative. The application itself has absolutely no business meddling in this. All it can do is declare some requirements for the environment (either explicitly or implicitly by making assumptions) and refuse to work or malfunction if they are not met (and the user is still fully within their right to say: "screw you, I know what I am doing" -- and fool the app into thinking they are met and assume responsibility for any breakages).

With "individual packages", it's actually completely the same: the app can decide which ones it wants to use, but it's the user who decides which ones are available for use!

warsaw commented 5 years ago

On Mar 1, 2019, at 19:59, Ivan Pozdeev \report@bugs.python.org\ wrote:

Ivan Pozdeev \ivan_pozdeev@mail.ru\ added the comment:

On 02.03.2019 2:25, Barry A. Warsaw wrote: > The fact that .pth files are global and affect the entire Python installation. \<...> Right now, there’s no control over the scope of such environmental customizations; it’s all or nothing.

That's the entire purpose of "customizing the environment in which the program specified by the user would run". A customization can very well be implemented to be application-specific but it doesn't have to. Python was never designed to isolate modules from each other (an "application" as you say it is just another module) -- on the contrary, the amount of power it gives the user over the code that they don't control is one of its key appeals. A Python installation acts as a unit where anything can affect anything else, and the order is maintained with https://en.wikipedia.org/wiki/Soft_security .

So I just come at it from a different angle (I think Steve and I are aligned).

Here’s a very real use case about the dangers. I use my Linux package manager to install a bunch of applications (I don’t totally agree with the “an application is just another package”). I don’t even know that they are Python applications, they’re just tools that do something I like. Now I have an idea for some cool Python thing to hack on and I just install a few development libraries with my package manager. Maybe those libraries come from a secondary repo that has a different level of scrutiny. Or maybe I think, hey what’s the harm to just sudo pip install a few things (yes, people do this all the time ;).

Subtly, under the hood, one of those transient dependencies down the stack installs some .pth file that executes some arbitrary code and breaks some of those distro provided applications. And lets say I don’t notice weird things happening for a week. Now I think “whoa! how did that application break? I didn’t change it at all”. Not only did I mysteriously break things I relied on, but unless I’m an expert Pythonista and I know how to debug site.py, I’ve got almost no hope of fixing the problem by myself (SO to the rescue?). If I do manage to diagnose the problem, I’ll have to go and uninstall the bad package, and I *should* report things to my distro or upstream. Of course, upstream may say that it’s critical functionality to their library so too bad for you.

I’m not even making that up. :)

Sure, maybe the very concept of a distro-wide Python application is a mistake, but it’s what we have now, and it’s not going away.

> Applications should be able to opt in or out of them, just like they can with individual packages (which must be imported in order to affect the interpreter state). Right on the contrary. To decide what environment an application shall be run in is the user's prerogative. The application itself has absolutely no business meddling in this.

Again, I just look at this from a different perspective. The user probably doesn’t even know all the environmental factors that affect the operation of their applications, and changes in that environment can happen without the user’s knowledge. All they’re going to know is that application X which is critical to their work has just broken. Sadly, the engineer looking into the bug they filed on it will not be able to reproduce the problem. And now nobody is happy. :)

With "individual packages", it's actually completely the same: the app can decide which ones it wants to use, but it's the user who decides which ones are available for use!

It’s actually not the same, and that’s the point. An application won’t ever import a library that actively harms it. But it has no such guards against changes to the environment — any environment, including magical Python code.

cbf13ede-eda8-4246-abee-98732ce73413 commented 5 years ago

On 02.03.2019 9:01, Barry A. Warsaw

In all the cases you've described, Python is no different from any other Linux software. E.g. I can install something into /etc/profile.d that would break the shell or set an envvar that would change the behavior of standard utilities. This is by design: Linux is designed for maximum interoperability, so there's only one of each component in the system and everything uses it whenever it needs that kind of functionality. It does support multiple versions of the same software, but it's a compromise that significantly complicates maintenance (primarily how to disambiguate them when something requests just "component X"), so they strive to avoid it whenever possible. Likewise, complete freedom for root to wreak havoc in the system is also by design: distro maintainers only test and support official packages; anything else you use is either your responsibility or an app supplier's if they provide official support (and are within their right to deny support if you tweak the environment beyond their support promise) -- same as for any other software as well.

This is not even specific to .pth files, either, so you won't really eliminate the problem by removing them. You can break any other part of Python in subtle ways just as well -- e.g. overwrite or override binary files with incompatible ones, causing segfaults in random places (https://stackoverflow.com/q/51816639/648265 ).

Now, Linux does have "lower tier environments" that don't automatically affect "higher tiers". 1) Software installed into /usr/local doesn't hijack system scripts thanks to absolute paths in their shebangs; software in /opt is not on PATH at all; 2) /etc/profile and bashrc are only executed by login shells and interactive shells, not by scripts, limiting their effect to processes created within a user session; 3) anything within a user's profile or run as a regular user (including \~/.bash) doesn't affect system-wide settings and processes run as root.

Blindly replicating 2) won't do for Python, however. Unlike Bash which has all the functionality compiled in, Python has an external standard library and arbitrary additional packages. They both are essential for its operation as a system component that other software can use without additional manipulations, AND Python gives the user freedom on how to arrange them in the system. So there must be a way to provide any "additional manipulations" that may be needed that the built-in startup logic doesn't have. From administration POV, any such startup logic is a part of the core offer to the system: core files+libraries+connecting logic = Python system component, so it must be invoked whenever Python is invoked. And we do already have ways to apply startup code only to a "lower-tier environment" if such a need arises: user-specific -- user site; interactive-specific -- PYTHONSTARTUP. There's no such thing as a "login shell" for Python but there's Python run in a user session; /etc/profile* can set envvars that would apply only there.

So it seems to me that what you are asking for is "/etc/profile.d for Python". When designing such a feature, note, however, that the concept of login sessions is completely alien to Python. I believe a way to provide an additional site-packages directory will do (I can't readily see an already available way to do so in https://docs.python.org/3/using/cmdline.html ).

341ce0e2-bdbb-4bd4-99e6-746f11201a3f commented 5 years ago

I did my best to classify those on pypi that were using .pth files. My initial search had quite a few false positives (and now that I look at it, completely missed .zip-based source distributions so there's likely some false negatives as well)

Here's the summary of the categorizations:

$ cut -d, -f2 < data.csv | sort | uniq -c
      2 backport
      4 coverage
      4 debugging
      2 demo
      9 encoding
      7 except-hook
     58 false-positive
      6 import-hook
     20 module-layout
     20 monkeypatch

I realized about halfway through that "monkeypatch" was probably too broad of a category but continued with that through all of them, the monkeypatch category contains a few classes of things: fixing third party libraries, disabling ssl (yikes!), adding some "features" to builtins / stdlib modules -- which unfortunately I didn't really classify properly.

There was a single .pth file that I deemed "malicious" since it completely breaks the subprocess module (subprocess-run) but other than that they all seemed ~mostly not the worst.

A lot of the module-layout ones could be solved with things provided directly by setuptools, or just be rearranging their distribution's files.

The raw data is available in csv: https://github.com/asottile/pth-file-investigation/blob/master/data.csv

57a92b3d-4509-44f6-960d-017b524a2a96 commented 5 years ago

There was a single .pth file that I deemed "malicious" since it completely breaks the subprocess module (subprocess-run)

It only seems to set an attribute. What's wrong with that? Does the early import of subprocess cause problems?

341ce0e2-bdbb-4bd4-99e6-746f11201a3f commented 5 years ago

There was a single .pth file that I deemed "malicious" since it completely breaks the subprocess module (subprocess-run)

It only seems to set an attribute. What's wrong with that? Does the early import of subprocess cause problems?

It assigns subprocess.run, which is an api in python3.5+. In those versions, subprocess.check_* is implemented in terms of subprocess.run. The subprocess.run provided by that package has a different api than the stdlib one so any use of the subprocess module is broken just by having that package installed

warsaw commented 5 years ago

On Mar 6, 2019, at 19:04, Anthony Sottile \report@bugs.python.org\ wrote:

It assigns subprocess.run, which is an api in python3.5+. In those versions, subprocess.check_* is implemented in terms of subprocess.run. The subprocess.run provided by that package has a different api than the stdlib one so any use of the subprocess module is broken just by having that package installed

Doesn’t that kind of prove my point? :)

341ce0e2-bdbb-4bd4-99e6-746f11201a3f commented 5 years ago

Doesn’t that kind of prove my point? :)

It's not any worse than gevent ~breaking~ monkeypatching almost the entire standard library. And to be fair to the author, it was created well before (2013-06-21) python3.5's run api existed (2015-04-14)

It's also the only problematic package that I could find -- if anything it's an indication that this feature is used (almost entirely) for good and without issue.

57a92b3d-4509-44f6-960d-017b524a2a96 commented 5 years ago

Doesn’t that kind of prove my point? :)

So basically you'd remove the whole feature just cause one package no one installs abuses it. Doesn't make sense.

zooba commented 5 years ago

There are two features here, let's be clear about what we're removing.

extending sys.path with static (perhaps relative) directories
arbitrary code execution (following "import " statements)

Only Barry wants to remove the first one, and the rest of us will push back hard enough to keep him in check ;)

Basically everyone wants to remove the second one, but we can't do that until there is replacement functionality for its legitimate use cases.

Looking at Anthony's list (and making some assumptions about what the titles mean), I'd propose that only encodings require a way to register them from an installed package. And maybe this is as simple as making "encodings" a namespace package?

For the others:

backport, demo - no idea what these look like
coverage, debugging, demo, except-hook - application/user responsibility, not a package's
monkey-patching - kill it with fire
import-hook, module-layout - easy enough to work around

(For those who are confused about the last, using a package __init__.py is how to modify these *when your package is actually loaded* and not on startup.)

57a92b3d-4509-44f6-960d-017b524a2a96 commented 5 years ago

coverage, debugging, demo, except-hook - application/user responsibility, not a package's

Elaborate please, as it sounds like you're simply dismissing my usecase.

341ce0e2-bdbb-4bd4-99e6-746f11201a3f commented 5 years ago

I think nearly all of the use cases in the packages are valid (except module-layout) -- or at least if this feature were removed without having a startup-time site-packages code execution feature there would be no possible replacement. I'll elaborate a little more on the titles I've chosen:

backport: provide features that are not available to that python version, but were ratified peps in later versions. These necessarily must happen at startup as to affect the application being used
demo: these are fine to ignore, the two packages that were classified here were merely demoing how to use .pth files can be packaged with setuptools
coverage: almost all of these were "automatically instrument coverage in subprocesses under test", basically the need to enable coverage tracing in subprocesses triggered by the application under test. It is not possible to do this in any other way than an initialization hook in the interpreter (or monkeypatching the subprocess module, which I'd argue is significantly worse than what this is doing)
debugging: these provide additional introspection tools to analyze an application, these also need to be interpreter level as you cannot customize code outside of your control but may need to debug such code.
except-hook: these also seem necessary as well, from the few I looked into more detail they seemed to be setting hooks such that $foreign-application could be used within another framework -- looking very similar to ubuntu's sitecustomize.py which sends traces to apport on crash (bug reporting for python-based packages). If you had ownership of this application sure you could add an except hook, but these seem t be for cases where you do not control the application
monkeypatch: I don't think we should be so swift to banish this category, sure the name is scary but there were many legitimate cases here. Many of these were to patch limitations in packages outside of control (dead, no longer accepting patches, not willing to support other platforms, etc.). the patches necessarily happen at startup because there's no other place to influence the code of these third party tools. Don't get me wrong, monkeypatching is usually bad, but I don't think there would be an alternative to how these tools function if this feature were removed.
import-hook: I also don't see an easy way to work around these, most of these added alternate filetypes that python could import, but you need something to make importing work in the first place

Basically everyone wants to remove the second one, but we can't do that until there is replacement functionality for its legitimate use cases.

Without a poll I don't think assuming a majority is fair ;)

zooba commented 5 years ago

Elaborate please, as it sounds like you're simply dismissing my usecase.

I'm suggesting that to enable this functionality at startup, the user/application should have to do something like executing code or setting PYTHONSTARTUP.

What I'm dismissing is that "pip install some-package" can define a global startup task for your interpreter. I shouldn't get debugging or code coverage enabled every time I run "python" just because I installed some package - I should have to start that package somehow.

zooba commented 5 years ago

I don't think there would be an alternative to how these tools function if this feature were removed.

Right now, maybe, which is why we haven't just removed it :)

The point of the discussion is to say "this functionality is irreplaceable so we need to design a replacement". If a package can't do monkeypatching when imported for some reason, we should explore what those reasons are and provide a supported way to achieve their goals (or document the existing ways).

zooba commented 5 years ago

Here's a trivial workaround for the import hook problem:

Assume we have "my_module.foo", and the import hook enables importing foo files.

Instead of just shipping "my_module.foo", you ship "my_module.py" and "_my_module.foo", where "my_module.py" looks like:

    import my_import_hook
    my_import_hook.install()

    from _my_module import *

This really isn't hard to do. As a bonus, you don't even need a full import hook anymore - you can use any kind of loader you want. And it should be fully backwards compatible (assuming special tricks weren't part of your public API), so your users won't even notice the upgrade.

341ce0e2-bdbb-4bd4-99e6-746f11201a3f commented 5 years ago

What I'm dismissing is that "pip install some-package" can define a global startup task for your interpreter. I shouldn't get debugging or code coverage enabled every time I run "python" just because I installed some package

At least for the coverage tools they all play nice and require an environment variable to be set for them to take. For example, coverage-enable-subprocess requires COVERAGE_PROCESS_START=... in order to start: https://github.com/bukzor/coverage_enable_subprocess/blob/9a0f4df99f0d008eba305c673dfae4269c6c5642/setup.py#L14

I should have to start that package somehow.

pip install is a pretty good opt-in already imo

Instead of just shipping "my_module.foo", you ship "my_module.py" and "_my_module.foo", where "my_module.py" looks like:

but that's exactly my point, now you have to ship extra junk python files when it's a way better experience to have the hooks _just work_

warsaw commented 5 years ago

On Mar 7, 2019, at 07:38, Steve Dower \report@bugs.python.org\ wrote:

Steve Dower \steve.dower@python.org\ added the comment:

There are two features here, let's be clear about what we're removing.

extending sys.path with static (perhaps relative) directories

arbitrary code execution (following "import " statements)

Only Barry wants to remove the first one, and the rest of us will push back hard enough to keep him in check ;)

Not true! I’m okay with keeping the path extension feature, albeit with improvements:

Loading of .pth files and path extension should be expressed in verbose (python -v) output
It should be possible to much more easily debug .pth file loading (I believe there is a PR for this but I haven’t had time to look at it yet)
It should be possible to prevent .pth file loading, likely via interpreter switch or environment variable, akin to -s/-S

warsaw commented 5 years ago

On Mar 7, 2019, at 09:32, Anthony Sottile report@bugs.python.org wrote:

I should have to start that package somehow.

pip install is a pretty good opt-in already imo

Except that it conflates responsibilities. I may not want to opt into coverage even being loaded in my application because I’m not going to use it and it has a negative impact on my application’s start up time. Yet because you’re on the same machine and you pip installed it, I have no choice but to pay those costs, which I haven’t explicitly opted in to.

341ce0e2-bdbb-4bd4-99e6-746f11201a3f commented 5 years ago

I should have to start that package somehow.

pip install is a pretty good opt-in already imo

Except that it conflates responsibilities. I may not want to opt into coverage even being loaded in my application because I’m not going to use it and it has a negative impact on my application’s start up time. Yet because you’re on the same machine and you pip installed it, I have no choice but to pay those costs, which I haven’t explicitly opted in to.

At least for the coverage plugins there is a required opt in from environment variable (as shown above). Though the startup cost is a good point. Perhaps I'm of the minority but I use virtualenvs for everything so I haven't even been considering the system python.

brettcannon commented 5 years ago

RE: " So basically you'd remove the whole feature just cause one package no one installs abuses it. Doesn't make sense."

No, the point being made is *at least* one package that was found on PyPI is abusing it, so it exists and we need to consider the possibility others are also abusing the feature.

On Thu, Mar 7, 2019 at 10:22 AM Anthony Sottile \report@bugs.python.org\ wrote:

Anthony Sottile asottile@umich.edu added the comment:

I should have to start that package somehow.

pip install is a pretty good opt-in already imo

Except that it conflates responsibilities. I may not want to opt into coverage even being loaded in my application because I’m not going to use it and it has a negative impact on my application’s start up time. Yet because you’re on the same machine and you pip installed it, I have no choice but to pay those costs, which I haven’t explicitly opted in to.

At least for the coverage plugins there is a required opt in from environment variable (as shown above).

For the ones you know about. Dealing with abuse of functionality isn't about what common practice is, but what a bad actor may do.

Though the startup cost is a good point. Perhaps I'm of the minority but I use virtualenvs for everything so I haven't even been considering the system python.

Trust me, from my perspective of the Python extension for VS Code you cannot ignore system installs.

57a92b3d-4509-44f6-960d-017b524a2a96 commented 5 years ago

because you’re on the same machine and you pip installed it, I have no choice but to pay those costs, which I haven’t explicitly opted in to.

At least for the coverage plugins there is a required opt in from environment variable (as shown above).

There's a simple if 'COVSOMETHING' in os.environ check to activate it. That can't cost a significant amount of time. This is rather another bad actor argument.

python / cpython

Deprecate and remove code execution in pth files #78125