Shared redesign of embedded requirements feature

jaraco / pip-run

pip-run - dynamic dependency loader for Python

MIT License

136 stars 19 forks source link

Shared redesign of embedded requirements feature #44

Closed jaraco closed 1 year ago

jaraco commented 4 years ago

Background

In pypa/pip#3971, key members of the PyPA have leveled critiques of the embedded requirements feature, where a user can supply requirements and other common installer directives to signal to the tool how (best) to run the script. This feature emerged from the use-case where a script author would like to distribute a single file and have that file be runnable by any number of end users with minimal guidance, allowing the file to be published to a gist or alongside other scripts in a directory without needing additional context for executability.

Goals

the tool should be able to infer install requirements and index URL.
the tool must be able to parse these directives without executing the script (as executing the script should be able to rely on the presence of those items).
the syntax should be as intuitive as possible. As a corollary, the syntax should aim to re-use syntax familiar to the user (primarily the author, but also the end user).
(optional) the tool should be able to infer additional installer directives such as --extra-index-url or --quiet.

jaraco commented 4 years ago

Option 1: Python Objects

The current implementation achieves the goals by soliciting special variables from the user. This feature was inspired by the setuptools technique from 15 years ago, in which requirements were embedded into the console_scripts generated by setuptools and then "required" at runtime by pkg_resources.

In this approach, the user specifies their requirements in the script using Python syntax and special variable names exposing static values. Example:

#!/usr/bin/env python
# Run this with pip run

__requires__ = ['pymongo', 'requests>2']
__index_url__ = 'http://my.corp/index/'

import pymongo
import requests
...

The tool parses the Python and extracts the requirement specifications from the declared variables and incorporates those during the install phase.

Pros

Re-uses the familiar Python syntax.
- Gets automated tooling support (consistent style through black, syntax checking and highlighting).
Re-purposes established technique established by setuptools.
Parsing is included (with every Python interpreter).
Allows for introspection by the script or other libraries (through __main__.__requires__).
Specification is dead-simple (define global variables using these names with static values).
Implementation exists and is proven to work.

Cons

Seems magical.
- Requires special handling (parsing routine).
Users may be tempted to dynamically construct the list (through runtime execution), but only static values are supported.
Pollutes globals()/__main__.
Only exposes specific features.
A Python list has more syntactic characters (,, []) than a simple list.

Variant 1: single variable

To limit polluting the namespace, the tool could solicit all possible values through a single variable, something like:

__pip_run__ = dict(
    requirements=[...],
    index_url = '...',
)

This approach might prove easier to extend. It also encapsulates the values into a namespace, avoiding proliferation of dunder-names. It is probably more tedious to use, though.

jaraco commented 4 years ago

Option 2: Requirements.txt in comments

This approach re-uses the requirements.txt syntax but as a multi-line comment in the script. Example:

#!/usr/bin/env python
# Run this with pip run

# requirements:
# -i http://my.corp/index
# pymongo
# requests >= 2

import pymongo
import requests
...

Some features of requirements.txt files might be excluded. The tool will parse out the values and will execute them as if they had been passed as their equivalent options on the command-line.

Pros

Allows for re-use of most/all of familiar requirements.txt support.
Avoids interacting with the execution (completely ignored at run time).
Re-uses existing requirements.txt syntax.

Cons

Users may expect to be able to use unsupported features of requirements.txt.
Python and requirements.txt syntaxes are entangled.
- Requires adding prefix characters (# and space) to every line.
Requires a specialized parser to recognize/extract lines.
- And a spec for the syntax.
Requires parsing and interpreting the file's contents.
Declared requirements cannot be readily inspected by the script or other libraries.

Variant 1: Install using `-r`

Instead of parsing the requirements.txt, the tool could save the contents of this comment to a temporary file and install that using -r. This approach would limit the work to parse and interpret the file's contents, but would have the unfortunate consequence of not honoring already installed packages.

jaraco commented 4 years ago

Option 3: embedded command-line arguments

The tool could solicit arbitrary arguments to the pip install command, either in a shebang header or with some specialized syntax. For example:

#!/usr/bin/env python
# Run this with pip run

# pip run install args:
# pymongo "requests>=2" --index-url=http://my.corp/index

Pros

Builds on familiar command-line syntax.
Easily and intuitively extensible.

Cons

It may prove difficult to reliably parse these command-lines. Presumably shlex would be used.
Shares many of the cons from Option 2.
User may attempt to pass invalid arguments.

Variant 1: Shebang inline

Instead of including the command-line options in the Python portion of the file, including it in the Unix shebang line.

#!/usr/bin/env python -m pip run pymongo requests>=2 --index-url=http://my.corp/index --

Pros

May allow the script to be executed directly in select environments (Unix environments where the correct python interpreter is indicated and pip run is available).

Cons

Shebang syntax is unfamiliar to non-Unix users.
Shebang syntax may not support > characters and definitely doesn't support spaces (i.e. requests >= 2.
It may not be obvious that supplying the parameters in a shebang would be interpreted in advance by a tool like pip-run.
Pip install args are entangled with other concerns (python invocation, run invocation).

Variant 2: Constrain the allowed arguments

It may be possible/desirable to constrain the allowed arguments select ones, but that of course requires maintaining that constraint.

Variant 3: Inline variable

Much like in Option 1, the command-line options could be solicited through a single variable:

__pip_run_args__ = ['pymongo', 'requests>=2', '--index-url=http://my.corp/index']

This approach would simplify the parsing of arguments, but has many of the same cons as option 1 but doesn't provide a structured syntax for soliciting the values. It compromises both the intuition of the user (who's thinking "command line parameters") and the tool execution.

0az commented 4 years ago

I'm not a fan of the magic variable syntax, so I guess I'll toss my own proposal in.

Option 4: Restricted Requirements.txt in comments

This approach is as Option 2, but only allows PEP 440 specifiers. In addition, a single-line syntax is provided for brevity.

The Requires: line MUST appear at the top of the file, before any import statements.

Pros

Easier parsing for humans and machines

Cons

Doesn't support URL, nor hashes for integrity.
Needs additional syntax for index (perhaps Index: URL?)
Many of the same cons as (2)
Not particularly extensible

Example

Single line example:

#! /usr/bin/env python3
# Generate visualization

# Requires: numpy pandas~=1.0 seaborn~=0.10

Multiline example:

#! /usr/bin/env python3
# Generate visualization

# Requires:
#   numpy~=1.18
#   pandas~=1.0
#   seaborn~=0.10

Variant Tangent

Something that I'd like is the ability to pretty-format. This is probably impractical, but here's what that'd look like: ```py #! /usr/bin/env python3 # Generate visualization # Requires: # - numpy~=1.18 # - pandas~=1.0 # Look, a comment! # - seaborn~=0.10 # - matplotlib>=3.2.1 # See GH#1234 ```

Overall ratings:

Option 1: -0.5 / mildly against Option 2: +0.5 / lukewarm, but positive Option 3: -1.0 / strongly against Option 4: +1.0 / strongly for

Anyway, these are my 2¢ for Proposals 1-3:

What use cases are there for introspection?
Is it necessary to have allow -i https://corp.invalid/index?
Is the command line syntax really that familiar?

And another comment for thought: what about the Notebook use case? This would be very useful for notebook distribution.

pfmoore commented 2 years ago

My preference is for option 4. I don't see any value in supporting the "full" requirements syntax (which is arcane, and not standardised). Being able to specify a list of projects the script depends on, possibly with a version restriction is IMO sufficient.

I'm not too concerned about the precise syntax. Multi-line, single-line or both doesn't bother me, and allowing the requirements to go after the imports isn't the end of the world either (I can see arguments for putting them at the end of the file). But anything works. I'm more concerned with semantics - that the data looks like a comment to Python, and is treated as a list of (project plus optional PEP 440 version specifier) values by pip-run.

I'm not clear why being able to specify an alternative index is important. In my view, "what index to use" is a property of the environment, not of the script, so it doesn't actually belong in the script metadata. There's also complex questions about whether to allow --extra-index-url as well as --index-url, which I think add confusion but little practical value. I'd prefer to omit this, and leave it as a command line option.

The same argument (it's a property of the environment, not the script) also applies to the various suggestions to let the script specify other pip-run or pip command line arguments.

To summarise my objections to the other options:

Option 1: I don't like "magic variables", and as you say users will expect to be able to build the value at runtime. Apart from this, it's the best of the options I rejected. Option 2: Full requirements file syntax is a mess, and limiting it will be confusing for users. Also a maintenance problem, as you'll need to track any changes pip make to the format. Option 3: Trying to parse arguments into an argv list is highly fragile, and there are too many pip arguments that we would not want to allow here.

I've not commented on any of the variants. None of them have any significant impact on my overall opinion of the options.

pfmoore commented 1 year ago

I'm looking at adding a similar feature to pipx - see https://github.com/pypa/pipx/issues/913.

My proposal there is

A block of comment lines in the source code. The first line must be # Requirements: and subsequent lines must be a hash, whitespace, and a requirement specifier, as defined by PEP 508. The requirements block is terminated by a blank line (or probably any line that doesn't start with a hash, I see little point in rejecting something "obviously" valid just because it misses out a blank line).

This is essentially option 4, with the following differences:

No need for the block to be at the top of the file.
The header is spelt "Requirements", not "Requires". I just think that reads better, TBH.
There's no single-line option, which IMO is redundant (and potentially problematic to parse, as PEP 508 allows a requirement to contain spaces).

pipx doesn't need an "Index:" section or anything similar, as it has command line arguments to handle that.

Using a common format would obviously be beneficial. But that can be fixed after the fact, so I'm only mentioning this for context (although if pip-run is able to use the same syntax as I'm adding to pipx, that's probably less work all round 🙂)

jaraco commented 1 year ago

What use cases are there for introspection?

The original case is that pkg_resources inspects __main__.__requires__ to activate any inactive (non-default) packages (mainly in support of environments where multiple versions were allowed to be installed, a behavior that's currently discouraged).

But I can imagine other scenarios where introspection could be useful:

a script would like to print out its requirements as part of a --verbose invocation or debugging information
a script would like to validate its requirements or pass its requirements to a tool for validation
a script would like to resolve where its requirements were installed (for triage and debugging) and to do that needs to know what its requirements are
a wrapper for the script may wish to perform analysis on the settings

tests for the script (including doctests) may make assertions about the requirements. For example, a script might add this doctest:

"""
This script must have no requirements.

>>> __requires__
[]

This script must not depend on lxml.
>>> any('lxml' in req for req in __requires__)
False
"""

Is it necessary to have allow -i https://corp.invalid/index?

I'm not clear why being able to specify an alternative index is important. In my view, "what index to use" is a property of the environment, not of the script, so it doesn't actually belong in the script metadata.

The use-case that motivated this behavior was I wished to satisfy the need where the script itself has dependencies it knows cannot be satisfied from PyPI and wishes to communicate that in its requirements. It's for the same reason that pip supports -i in a requirements.txt file. It's not always the case that the index is a property of the environment. In some cases (like a corporate environment), it may be the case that the index is used universally for all pip operations and so can be configured in the environment. And there are other cases where the user should select the index at invocation time. But there are uncontrived cases where the script itself is the most appropriate factor in the consideration of the index.

Consider for example a repro that demonstrates a fix with privately-published packages:

__requires__ = ['cherrypy', 'cheroot==6.5.6.dev28+g9413ed9c']
__index_url__ = 'https://m.devpi.net/jaraco/dev'

# ... demonstrate some behavior only present in cheroot 6.5.6.dev28+g9413ed9c

It violates a key principle of pip-run, that invocations should be one-liners or they should be executed as a script. pip-run is trying to get away from the multi-step instructions required for invocation.

Is the command line syntax really that familiar?

Yes. The syntax comes directly from pip (and is passed directly to pip), likely the most used command-line in Python.

what about the Notebook use case?

The notebook use-case is implemented (#45, example).

Of course, the notebook use-case could parse # requirements comments instead of using Python syntax, although Notebook authors might not want comments but would prefer a text block to declare requirements if they're not Python.

jaraco commented 1 year ago

Since there's some uncertainty about the future of pip-run, here's what I plan to do - I'll first implement option 1+4, so either format is supported. That way there's compatibility with the proposed pipx format as well as backward compatibility for the existing format. This dual-mode will give the formats a chance to compete for mind share and after some time perhaps one can be deprecated and retired.

jaraco / pip-run

Shared redesign of embedded requirements feature #44

Background

Goals

Option 1: Python Objects

Pros

Cons

Variant 1: single variable

Option 2: Requirements.txt in comments

Pros

Cons

Variant 1: Install using -r

Option 3: embedded command-line arguments

Pros

Cons

Variant 1: Shebang inline

Pros

Cons

Variant 2: Constrain the allowed arguments

Variant 3: Inline variable

Option 4: Restricted Requirements.txt in comments

Pros

Cons

Example

Variant 1: Install using `-r`