pypa / pipenv

Python Development Workflow for Humans.
https://pipenv.pypa.io
MIT License
24.81k stars 1.86k forks source link

What does pipenv need to stop vendoring a patched version of pip? #5379

Open pradyunsg opened 1 year ago

pradyunsg commented 1 year ago

The question says it all really. I'd like for the Python packaging ecosystem to move away from trying to use pip as a library, to broadly reduce the fragility to the tooling.

What are the parts of pip that pipenv tries to use, and what would be necessary to decouple pipenv from pip, so that it can use a shared implementation (eg: build, installer) instead?

pradyunsg commented 1 year ago

Adding a cross reference to https://github.com/pypa/pip/pull/10964

matteius commented 1 year ago

@pradyunsg I'll eventually have more time to elaborate, but my initial take is this would be ultimately very challenging for both parties, as we use many internal aspects of pip across pipenv and requirementslib. We vendor pip to ensure the internals are reliable for what is integrated against, and there are even a few patch file we apply: https://github.com/pypa/pipenv/tree/main/tasks/vendoring/patches/patched

A big part of the internals we use is the resolver, for which there is no public interface for. We also solved for package confusion attacks with the index restriction patch: https://github.com/pypa/pipenv/blob/main/tasks/vendoring/patches/patched/pip_index_safety.patch I opened a PR against pip for this but it was the same problem of using pip internals so not gaining acceptance: https://github.com/pypa/pip/pull/10964

We also have reduced the scope of our other vendor'd packages to reuse other vendor'd libraries provided by pip. Here are some imports we are currently referencing: image image

Plus a ton in requirementslib which we vendor in: image image

Basically it is my strong opinion that there is no great way to allow pipenv to access all of its own dependencies without vendoring them in--since it manages a virtualenv for the user, we cannot expect the user to use specific versions of the packages on their system in order to make pipenv work -- many bug reports were solved by moving imports from allowing arbitrary system package versions which may or may not exist, to be imported from the vendor'd paths in pipenv ensuring the same code is provided everytime.

pradyunsg commented 1 year ago

I expect this to be a challenging effort. :)

FWIW, it sounds like pipenv should invest in setting up an installer, like https://github.com/python-poetry/install.python-poetry.org which puts pipenv in a location outside of the user-managed environments, in a custom location with a virtualenv for itself, where it can manage its own dependencies.

matteius commented 1 year ago

One thing that would be really helpful if the pip internals allowed generating metadata that could be used pre-build for generation of lock files for systems other than the one being locked on. I spent a bunch of time this month on the problem of transitive dependencies that are targeted for other systems and the only real work around I see is to use #4745 to have lock sections for the project that users only lock that section on the specific platform they are intended for, which is not that ideal.

I have a couple additional points for the topic at hand:

  1. pip is also vendoring libraries into itself in a similar way and for similar reasons. Why is it ok for pip to do this but advocate for pipenv to not? (I get the argument that you don't view pip as a library)
  2. Some of pip basically feels like a library of helpful methods and utilities -- they could provided as a pip library for application developers to make more use of to solve use cases that aren't intended to be solved by pip. I think that is where pipenv is, we are solving some problems, such as index restricted packages, by using pip internals -- perhaps though if there were proper interfaces into the resolver then there would be less need for some of these pip library methods but I haven't done a full analysis.
  3. Without the packages we use supporting the required patches, it still forces us to vendor those things.

Another thing I would worry about is with regards to performance. When we do invoke pip for things it supports, it incurs a performance hit and additionally the existing interfaces aren't all that logical. I had originally re-wrote the batch_install logic to be done in a single pip install phase before learning for example that including hashes or not cannot be done on a line-by-line of the requirements.txt and required invoking 2 pip install phases because of the global options like require hashes that are all or nothing. My point here is the existing interfaces could also be improved to make less complicated code. Example of how this complicated the implementation: https://github.com/pypa/pipenv/blob/main/pipenv/core.py#L1532-L1704

pradyunsg commented 1 year ago

Why is it ok for pip to do this but advocate for pipenv to not? (I get the argument that you don't view pip as a library)

None of the arguments other than fragility from https://pip.pypa.io/en/stable/development/vendoring-policy/#rationale apply IMO.

Having pipenv live outside of the environments it manages and outside of a user-managed environment where they could run a command and break it, would solve the fragility problem while avoiding all the caveats that come with vendoring.

pradyunsg commented 1 year ago

I'll also note that currently, pipenv isn't merely vendoring dependencies but vendoring forks of those dependencies, given that you have patches to implement entire features in the packages/dependencies that you vendor.

That means that it's harder for downstream redistributors to redistribute the package and harder for you to keep up to date with the dependency (you are effectively managing a fork at that point). That's... not particularly sustainable IMO.

matteius commented 1 year ago

@pradyunsg That is true and we would love to be able to drop the patches, but currently they serve important purposes. We have dropped some of the patches that existed prior to this year since Oz and myself began working on the project. The problem for most of the functional pip patches is they are of the internals and won't be ported to pip unless pip itself expressly had a use for them by the public interface.

The index retricted packages patch is from this year however and was to address a security issue reported multiple time around the package resolvers preference or pypi and package confusion attacks -- we essentially solved this by saying the default source in your Pipfile is the source to use for all packages unless it is explicitly defined to use a different source. That was the case of why I opened the upstream PR, to try and avoid maintaining another patch file, however I have felt and continue to feel the security concern warranted the patch.

Thanks also for having this discussion--just a bit about me, I like to present the technical hurdles I know about. So its not that I am resistant to figure out how we can improve this, but I know what some of the problem set is like and haven't seen yet how to overcome the challenges. I am also proud of the current state of vendoring as we have really worked on the scripts and processes to ensure a consistent vendoring outcome with the proper imports--this was in a much worse state when I started. That being said, let's carve out some paths forward. We can explore the installer route, but I think some of the items mentioned are almost blocker to getting to the point of rolling that out.

pradyunsg commented 1 year ago

My point here is the existing interfaces could also be improved to make less complicated code. Example of how this complicated the implementation: main/pipenv/core.py#L1532-L1704

I'll note that none of this is designed to be a library. This sort of shit is exactly why pip isn't supposed to be used as a library in fact. :P

I like to present the technical hurdles I know about. So its not that I am resistant to figure out how we can improve this, but I know what some of the problem set is like and haven't seen yet how to overcome the challenges.

Yup yup, I understand and empathize with that. Sometimes, merely stating the situation itself can look like defending it. TBH, the way I see this situation is that... well, this project has a huge pile of technical debt from a series of really bad design decisions early on in its lifecycle. :)

matteius commented 1 year ago

This sort of shit is exactly why pip isn't supposed to be used as a library in fact. :P

But that is actually an example where we use sub-process and invoke pip "normally" using the public interfaces :-D

Thanks again -- I've been pretty busy with a construction project and being off work for two weeks, this week I've been back to work so its been a real whirlwind.

From the perspective of what could pip do to improve things, If there were a public interface to the resolver that we could use for the lock phase, that would be something we would try to integrate with over using the internal APIs to the resolver. Maybe this interface could make use of similar code to that reference PR from the index restricted packages, and then there would be a use case for that on both sides. We could pick apart the interface we currently have into the resolver to try and negotiate what the public interface could be like. This should be a lot more feasible to understand the current code now that pip-shims is not involved in obfuscating everything.

oz123 commented 1 year ago

@pradyunsg thank you for your interest in pipenv development. Matt has already answered many of your concerns. I'll just add my two cents: First,

well, this project has a huge pile of technical debt from a series of really bad design decisions early on in its lifecycle. :)

Can't agree more here with you. Pipenv was and is very popular, and it's not bad we have also other solutions, that compete in this space. However, not all of them do all that it does, and sometimes it's hard for users to move away to a new project. So, here we are humbly working almost for free to fix this situation. Pipenv was really in a bad shape but we are trying to keep the user interface and really rewriting it from the inside. Matt has done a tremendous job here on requirementslib and the resolver. I was more busy cleaning those forks and all those vendored libraries we were dragging.

FWIW, it sounds like pipenv should invest in setting up an installer

Actually, we do have an installer, which is based on the pip installer.

https://github.com/pypa/pipenv/blob/main/get-pipenv.py

However, you can also just use pip.

We'd like to see all the issues you mentioned fixed, but being just two devs with full time jobs, our time is limited. Hence, this will take time to fix everything. However, rest assured they will be fixed.

oz123 commented 1 year ago

BTW, I think we should move this to a discussion instead of an issued ...

pradyunsg commented 1 year ago

we are trying to keep the user interface and really rewriting it from the inside.

Same, with pip. :)

Actually, we do have an installer, which is based on the pip installer.

main/get-pipenv.py

That's not what I meant by "installer" -- Poetry's installer installs poetry in a separate virtualenv, placed at a well-known location outside of the user's Python environments.

That way, Poetry lives with its own environment (like an application). This environment cannot be broken by changing packages via pip unless the user goes out of their way to do that.

frostming commented 1 year ago

I'll note that none of this is designed to be a library. This sort of shit is exactly why pip isn't supposed to be used as a library in fact. :P

True, but there is a growing need to reuse some part of pip internals, and few has alternatives:

Most importantly, none of the above alternatives is as battle-tested as pip and the open-source maintainers have to work hard on keeping the behaviors consistent with pip. So it is a reasonable choice to vendor the entire pip as a library for pipenv.

IMO the situation won't improve until pip either extract all the components into independent libraries, or replace the internal functionalities with some of the open-source alternatives.

oz123 commented 1 year ago

If someone uses pip as a library and it's working for him, I don't see why pip developmer should care. Ofcourse if this same people always complained about it in the pip channels, the response could be: we told you not to do so. On the other hand, if some people use it and there is a need for it, pip development community could yield and make a public API (which isn't through shell commands). Ofcourse, these are just my two cents. However, the fact is that there are many python package managers and virtual environment managers ( pipenv, conda, hatch, poetry has all existed for a long long time) shows that there is a long time dissatisfaction with python package management. The maintenanrs of this project (myself including) scratch their own itch or have strong interest in package management, and thus decided to put effort into solving the problem the way they see fit. I honestly don't see why that interferes with pip development.

oz123 commented 1 year ago

If pip offered a public API, we could stop vendoring it. Here is another attempt to make a public API for pip.

https://github.com/reynoldsnlp/pipster

kalebmckale commented 1 year ago

My suggestion for isolating pipenv, or at least what I do, is install it via pipx. No reason to reinvent the wheel when a solution exists. Perhaps we could document this as an option? Now, I'm second-guessing if there is already reference to this somewhere in the documentation. 😁

frostming commented 1 year ago

My suggestion for isolating pipenv, or at least what I do, is install it via pipx

True, pipx isolates things quite well, but there are still people who installs the tool under the global site-packages. It may cause conflicts even in an isolated venv, since pip is such a fundamental library that exists in every venv. IIRC pipx installs pip in a shared location and inject the path via pth file. I suspect it may conflict with the pipenv dependency.

kalebmckale commented 1 year ago

No, no, I'm not suggesting to use pipx to isolate pip but pipenv. Different tools that create virtual environments often include their own copies of pip anyhow.

In practice, my user/local site-packages directory is bare-to-empty. I have a number of tools installed via pipx, including pipenv and inject related libraries and plugins for those application environments. Any packages related to a project or needed for development or testing of the project are included in categories of Pipfile (to ensure no dependency conflicts for users and developers alike) and installed (as needed) by pipenv into the project's virtual environment, created by pipenv. Also, I always use --no-site-packages to set up the environment.

Anyhoo... some of that may be a bit off-topic but just thought I'd share how my workflow uses pipenv.