pypa / packaging

Core utilities for Python packages
https://packaging.pypa.io/
Other
624 stars 251 forks source link

migrating from distlib.wheel: `packaging.utils.parse_wheel_filename` allows illegal platform tags #577

Open asottile-sentry opened 2 years ago

asottile-sentry commented 2 years ago

I'm looking to migrate a caller from utilizing distlib.wheel to instead use packaging.utils.parse_wheel_filename and my tests encountered an edge case which isn't handled by the packaging version

here's the distlib code: https://github.com/pypa/distlib/blob/91aa92e64ee5d5005901907eaa340a72f18d7212/distlib/wheel.py#L83

packaging's parse_tag doesn't validate the interpreter in any way -- whereas distlib required a particular structure

the test case in question: https://github.com/chriskuehl/dumb-pypi/blob/b3fc62c6519bdb84960d2b491b40ad065fb79a47/tests/main_test.py#L61

playlyfe-0.1.1-2.7.6-none-any.whl should not parse as a valid wheel, however packaging allows it (and parses as three separate tag-triplets: 2-none-any, 7-none-any, 6-none-any)

I propose adding a little bit of validation to parse_tag to check that the tags are supported interpreter versions

I believe this involves something like:

_INTERPRETER_RE = re.compile(fr'^(?:{"|".join(sorted(INTERPRETER_SHORT_NAMES.values()))}\d+')

and then utilizing that regex to validate the names (and throwing an appropriate exception) -- I think a new InvalidTag (extending ValueError) or somesuch exception would need to be created as well (similar to the ones in packaging.utils for the wheel / stdist filename)

wanted to propose this first before diving in and implementing it to save time -- thoughts?

ptmcg commented 2 years ago

Missing a trailing ')' in your re, I assume after the closing "}".

_INTERPRETER_RE = re.compile(fr'^(?:{"|".join(sorted(INTERPRETER_SHORT_NAMES.values()))})\d+')

If you are building an re of values from some other source, sorting by descending length will be more likely to give you a unique match, in the case where some short value is a prefix of some longer value.

...sorted(INTERPRETER_SHORT_NAMES.values(), key=len, reverse=True)...

Though probably not an issue in this small group of values (after looking them up in the code).

brettcannon commented 2 years ago

Unfortunately the best we could do based on my reading of the spec (obviously let me know if I missed something) is test whether the name is normalized as required by https://packaging.python.org/en/latest/specifications/binary-distribution-format/#escaping-and-unicode . Otherwise the interpreter short names are not guaranteed to be the only valid interpreter names, just convenient ones (e.g. Pyston is not covered by that list). We can't make any assumptions as to what interpreter name is considered valid, and so 2.7.6 as a compressed tag of silly interpreter names is still valid.

I can totally see having a higher-level function in another library that requires expected interpreter tags, but as such a low-level library I don't think we can be that pragmatic about what is usually expected and have to stick to the spec in this instance.

asottile-sentry commented 2 years ago

PEP 425 adds more color: https://peps.python.org/pep-0425/#python-tag -- that it should be implementation name and then version (or version with an underscore) -- so I believe \w+[0-9][0-9_]* would be a required pattern for a tag

asottile commented 2 years ago

@brettcannon should this be reconsidered given above?

brettcannon commented 2 years ago

I'm okay with re-opening it, but whether this is a feature request or a bug I'm still torn about. For instance, while PEP 425 says "The version is py_version_nodot," that's entirely CPython-specific as to what that means. As such, do we get to say that it must follow https://peps.python.org/pep-0440/#version-scheme ? What about trailing underscores like your regex allows?

As well, "Other Python implementations should use sys.implementation.name" puts no actual restriction on the format of the name, i.e. there's nothing stopping it from starting with a number. Since the Python interpreter tag is not designed to be parsed but to be an opaque string for code to compare against while still being human-readable makes all of this rather murky.

pradyunsg commented 2 years ago

"Other Python implementations should use sys.implementation.name" puts no actual restriction on the format of the name

It actually does! From https://peps.python.org/pep-0421/:

name A lower-case identifier representing the implementation. Examples include ‘pypy’, ‘jython’, ‘ironpython’, and ‘cpython’.

Based on the discussion so far, I reckon requiring it to either be a f"{shortname}{numbers}" for a shortname we recognise, or an identifier is a reasonable restriction.