An option to start a local index proxy when running pip

pfmoore commented 1 year ago

What's the problem this feature will solve?

Many problems people have with index server handling (such as index priority, dependency confusion attacks, filtering available package releases by age, etc) can relatively simply be solved using an index proxy, that presents a "view" of the underlying index(es). However, users are typically reluctant to use such a proxy index. Generally it's difficult to get people to articulate the reasons they don't like this option, but the most common complaint is the need to manage a separate running service.

Describe the solution you'd like

If pip had an option that specified a script which started an index server for the duration of the pip invocation, people could use this to avoid the need to have a permanently running index proxy.

For example, the user creates[^1] a script my_index.py, which starts up an index server. Then, they invoke pip using the command pip install --index-script=my_index.py .... When started, pip will run the index script, and communicate with it to agree on a proxy URL that it will provide. The rest of the pip invocation works as normal, using the temporary index. When pip completes, it shuts down the proxy automatically.

The details of the communication between pip and the script will need to be established, but it can probably be as simple as pip choosing a port number, and passing it as an argument to the script.

[^1]: In an ideal world, an ecosystem of proxy implementations will become available, so the user simply downloads a suitable script and configures it.

Alternative Solutions

The current approach, where the user has to manually start a proxy before running pip, is viable, but appears unattractive to users as a solution.

Alternative solutions to individual issues have been proposed as pip feature requests - for example, #8606 and https://discuss.python.org/t/proposal-preventing-dependency-confusion-attacks-with-the-map-file/23414. These solve individual problems, but are not as general as the proposed solution.

Additional context

Creating an index proxy script is potentially more complexity than many users will be comfortable with, which is likely to limit the adoption of this proposal. This can be addressed, at least in part, by publishing a set of scripts that handle well-known cases like index prioritisation.

Unfortunately, pip has a limited amount of developer resource, and that makes it difficult to implement solutions to the various issues raised in a timely manner. Also, it's necessary to make sure that any solution works for every user's situation, which further delays resolution. By, in effect, "outsourcing" the work of solving the issue to the end user, simple, tightly focused solutions can be delivered in a much more timely manner, and pressure is taken off the volunteers supporting pip.

Code of Conduct

[X] I agree to follow the PSF Code of Conduct.

dstufft commented 1 year ago

FWIW, I don't want to say that I'm -1 on this specifically, but I don't think it's going to solve the use cases that people keep asking for like filtering by age, etc. Like if people really want a way to start a temporary index server with pip, then it's fine? but I don't think most people want that, nor would be particularly happy with requiring that for the feature they actually want.

pfmoore commented 1 year ago

I agree that it's a solution designed by a techie, which is only likely to be loved by techies. So sue me 🙂 And it's absolutely me scratching a personal itch. But I do think it could enable reasonably user friendly idioms.

Imagine, for example, pip install --proxy-script="newer-than 2020-01-01". I'm playing with my proposed syntax here (no .py extension on the script name, allowing args to be passed to the script) but that's just messing with the UI, not changing the basic idea.

I can even imagine someone writing a customisable proxy, so that users could put proxy-script = fancyproxy in their config, and then configure what they want in a fancyproxy.toml configuration file. Again, it wouldn't suit everyone, but it might suit a sufficiently large segment of the (already small) group of people wanting options to control pip's index scanning, to give us a solution we can offer them in the short term, while still leaving open the possibility of a built in solution in the longer term.

zooba commented 1 year ago

I'm also inclined to think that "start a local index proxy" is a much bigger feature than "constrain certain packages to certain indexes (possibly via the constraints file)". Typically the latter is what's needed, and some kind of wildcard in names in constraints files would get to basically the same point as simpleindex covers (without drastically increasing the number of things they could screw up).

In our case, we have builds of particular libraries that we want to ensure use our wheel rather than trying to build from sdist again (which we know is going to fail). Simply using find-links doesn't prevent a newer release on PyPI from trying to build from source, and we need to allow a range of other packages to install normally.

That said, it's a pain in CI to start simpleindex, configure the correct PIP_INDEX_URL and PIP_TRUSTED_HOST variables, and run the main task in a separate step, then safely close simpleindex, so having it all encapsulated in pip would be great. But then, we also use it to override tox runs, so perhaps we'd still need our existing approach to ensure the settings get through? Plus we use it to wrap up Azure Artifacts in a way that enables pip caching to work and avoids ever having to put authentication tokens in URLs, which is a totally different feature request.

In any case, I'm +1 on the problem description, and +0 on this proposal (I'll take it if there's motivation to implement it, but I think a separate constraints list is simpler and almost as good).

uranusjr commented 1 year ago

Personally I am inclined to say this is not suitable for pip, but I do have had similar needs for this, and I suspect the scenario is not uncommon (or it should be more common). It’s probably a good idea to develop a ready-made solution that spins up a server in the background and run pip against it in one command so we can point people to it when asked.

EpicWink commented 1 year ago

In addition to a script, could you have a dependency specifier and an entry-point? Eg --index-server='simpleindex >= 0.6 : simpleindex:run'. Any arguments to the entry-point would likely want to be defined in a PEP.

If the script is run through a sub-process, perhaps instead of passing arguments, you look at its stdout, grepping for eg .*listening on[^:]*: (?P<url>.+) and use that to set the index URL.

pfmoore commented 1 year ago

In addition to a script, could you have a dependency specifier and an entry-point?

TBH, the intention here is to remove complexity from pip, by providing one, simple, option that covers common requests. Adding complexity to this feature is to an extent working against that goal. So yes, we could, but I'd prefer not. For example, you're saying include a dependency specifier - do you expect pip to download that dependency? Where should pip install it? How should it download it, when we haven't started the proxy yet? Things get out of hand very fast.

It’s probably a good idea to develop a ready-made solution that spins up a server in the background and run pip against it in one command so we can point people to it when asked.

Yeah, I feel like the reaction to this proposal has mostly been "we should make it easier to use proxy servers, but not like this". And if it doesn't result in fewer people asking for more complexity in pip, then it's not adding much value.

Maybe a separately distributed utility pip-with-proxy <args to start the proxy> -- <args passed to pip> is sufficient.

sinoroc commented 1 year ago

If I understood the problem space correctly, I feel like this could be solved with a declarative configuration file like the one I hinted at here: https://discuss.python.org/t/python-packaging-strategy-discussion-part-1/22420/127

I will call this potential file format Constraints 2 for now, because I feel like constraints.txt is the closest we have to this currently, and I think that would replace it.

I understand that by relying on a separate proxy process (like simpleindex) as suggested in the initial post, there would be relatively little change needed in pip itself, which is an advantage.

On the other hand, although it would require much more work, I feel like the problem space that constraints 2 could cover is much wider, than what could be done with a proxy. If I am not mistaken this would cover dependency confusion attack as well, which seems to have prompted this discussion here and this other one.

If there was to be such a Constraints 2 file format, possibly standardized, then other tools (installers, dev workflow tools, and so on) could also take advantage of it as well.

Is it something potentially interesting? Should I try to start a dedicated discussion thread somewhere? Gather some use cases and rough design concepts?

pfmoore commented 1 year ago

Is it something potentially interesting? Should I try to start a dedicated discussion thread somewhere? Gather some use cases and rough design concepts?

IMO, it's offtopic for this issue, but it's potentially interesting. Personally, I'd like to see an implementation in the form of an index proxy that implemented filtering via a "constraints 2 format" configuration. That would be usable right now by pip, would be portable to any other application that wanted to view indexes through a "constraints 2" filter, and would allow rapid development independent of the resource constraints of projects like pip.

If it became popular, and people found starting up a standalone index proxy a problem, then we could look at building the implementation into pip. At that point, we might even consider vendoring the proxy library and making the pip implementation be "start the vendored proxy, and point pip at it".

So yes, go for it, but I would strongly recommend basing the initial proposal on a filtering server proxy, rather than on "if pip implements this, life would be great". Because the latter is a great way to end up with all talk and no action, unfortunately.

sinoroc commented 1 year ago

@pfmoore It cannot be just a proxy, because what I have in mind is to also fix the metadata of some packages. So in other words it would not just impact the "package finder" but also the "dependency resolution". I will try to put my ideas in a dedicated thread. Thanks for feedback.

Anyway, I very much like the concept and implementation of simpleindex, I recommend it often. So your initial suggestion (of a tighter integration) seems good to me.

pfmoore commented 1 year ago

@sinoroc OK, in that case (a) it's definitely offtopic for this thread, and (b) I'm concerned that it's yet more complexity for pip (and other standards-conforming tools) to have to implement. I'd like to see the proposal discuss "how does this reduce overall complexity rather than increasing it", as the current trend seems to be to pile ever more requirements on tools.

sinoroc commented 1 year ago

@pfmoore Understood. I will try to publish my thoughts on this in a document somewhere. Feel free to mask these messages as off-topic.

zooba commented 1 year ago

Given that it's likely that the index proxy will have more complex settings than pip, maybe we should be pushing the command into the proxy tools? So more like:

python -m simpleindex --config <CONFIG> -- python -m pip install -r requirements.txt

(and it sets the environment variables for pip, and maybe other relevant tools if needed, while running arbitrary commands after the --, since tox is also a likely candidate)

uranusjr commented 1 year ago

Ah that makes sense. Probably not in the main entry point but something like python -m simpleindex.runpip feels like a good idea.

pelson commented 2 months ago

I'm a strong proponent of there being a solid and re-usable simple repository core library which includes proxying, filtering and merging functionality. If such a library existed, why not vendor it into pip and expose the behaviour through pip config? Sure, there is some additional configuration complexity is within pip, but the behavioural implementation/complexity is standalone and re-usable.

I agree that it's a solution designed by a techie, which is only likely to be loved by techies

maybe we should be pushing the command into the proxy tools

I think the proposal would be quick to implement, but IMO lacks realistic adoption - it doesn't feel like a well integrated solution to the problem that vendoring a domain specific tool could provide. It is also unnecessarily wasteful to start a new process, to open up a new port, and to use HTML to communicate between processes (not to mention the subprocess issues that would inevitably turn up (e.g. a race condition on opening the target repository's port when invoking pip concurrently)).

Based on my recent simple-repository experience, all of this can be done through a standard object interface, when done well. To give a flavour of what I have in mind (pseudo-code) on the pip side:

def build_repo(pip_config) -> SimpleRepository:
    repo: SimpleRepository

    # Start with the base http repo
    repo = HttpRepo(pip_config.index_url)

    if pip_config.extra_index_url:
        repo = MergedRepo(repo, HttpRepo(pip_config.extra_index_url)

    if pip_config.find_links:
        repo = MergedRepo(repo, LocalRepo(pip_config.find_links)

    if pip_config.exclude_newer:  # The uv feature --exclude-newer
        repo = TimeFilterRepo(repo, exclude_newer_than=pip_config.exclude_newer)

    ...

    return repo

Within pip, this could then be consumed through standard methods (e.g. get_project_page and get_project_resource).

The premise that you can chain a repository is also the basis for the simple-repository core library (https://github.com/simple-repository/simple-repository), but IMO the idea is applicable whether or not that library is considered.

pypa / pip