phylum-dev / cli

Command line interface for the Phylum API
https://phylum.io
GNU General Public License v3.0
102 stars 10 forks source link

Python package names not normalized #1274

Closed maxrake closed 10 months ago

maxrake commented 10 months ago

Overview

Python package names are not normalized during parsing and lockfile generation. This results in situations where repeat entries are included (like #1273) in parse results and "new" entries created during lockfile generation that only differ in non-normalized name format.

How To Reproduce

Steps to reproduce this behavior for parsing:

Details (click to expand)

``` ❯ phylum --version phylum v5.8.0 ❯ cat requirements.txt pyyaml==5.3.1 PyYAML==5.3.1 ❯ phylum parse --lockfile-type pip requirements.txt [ { "name": "pyyaml", "version": "5.3.1", "type": "pypi", "lockfile": "requirements.txt" }, { "name": "PyYAML", "version": "5.3.1", "type": "pypi", "lockfile": "requirements.txt" } ] ``` Both packages are expected to normalize to `pyyaml` and be deduplicated (after #1273 is complete).


Steps to reproduce this behavior for lockfile generation:

Details (click to expand)

``` ❯ phylum --version phylum v5.8.0 ❯ cat requirements.txt pyyaml Ruamel-YAML-CLib ❯ phylum parse --lockfile-type pip requirements.txt Generating lockfile for manifest "requirements.txt" using Pip… [ { "name": "PyYAML", "version": "6.0.1", "type": "pypi", "lockfile": "requirements.txt" }, { "name": "ruamel.yaml.clib", "version": "0.2.8", "type": "pypi", "lockfile": "requirements.txt" } ] ``` The expected package names are `pyyaml` and `ruamel-yaml-clib`, respectively.


This becomes more of a problem when looking for new dependencies (like in phylum-ci) and some packages are reported as new even though they aren't (they only differ in pre-normalized name, but post-normalization would show the package as unchanged).

Details (click to expand)

What the file looks like in `main`: ``` ❯ phylum --version phylum v5.8.0 ❯ cat requirements.txt pyyaml==6.0.1 ❯ phylum parse --lockfile-type pip requirements.txt [ { "name": "pyyaml", "version": "6.0.1", "type": "pypi", "lockfile": "requirements.txt" } ] ``` Changing the file in a branch to reference another requirements file: ``` ❯ touch requirements-dev.txt ❯ vim requirements-dev.txt ❯ cat requirements-dev.txt pillow==10.1.0 ❯ vim requirements.txt ❯ cat requirements.txt pyyaml==6.0.1 # Add an entry that turns this into a manifest # or otherwise trigger lockfile generation. --requirement requirements-dev.txt ❯ phylum parse --lockfile-type pip requirements.txt Generating lockfile for manifest "requirements.txt" using Pip… [ { "name": "PyYAML", "version": "6.0.1", "type": "pypi", "lockfile": "requirements.txt" }, { "name": "Pillow", "version": "10.1.0", "type": "pypi", "lockfile": "requirements.txt" } ] ``` The `PyYAML@6.0.1` package shows as new/different in `phylum-ci` when putting up a PR for the branch even though it is exactly the same as the `pyyaml@6.0.1` package in `main`. This will result in unexpected "added" packages.

Expected Behavior

Python package names are normalized everywhere they are obtained so results from the CLI are consistent, regardless of the input format.

Additional Context

Python package names are provided in various forms of uppercase and/or using mixed ., -, _.

To normalize a distribution name for comparison purposes, it should be lowercased
with all runs of the characters ., -, or _ replaced with a single - character.
This can be done using the following snippet of code (as specified in PEP 503):

re.sub(r"[-_.]+", "-", name).lower()

References:

The same issue likely also exists for other ecosystems besides Python.

Maybe it is possible to use the purl crate for this.

It is believed that the Python package names should be normalized in at least these locations:

Details

``` ❯ git diff diff --git a/lockfile/src/parsers/pypi.rs b/lockfile/src/parsers/pypi.rs index 4064e3b..2912610 100644 --- a/lockfile/src/parsers/pypi.rs +++ b/lockfile/src/parsers/pypi.rs @@ -82,6 +82,7 @@ fn package<'a>(input: &'a str, registry: Option<&str>) -> IResult<&'a str, Packa } let (input, name) = package_name(input)?; + // TODO: Normalize this name let name = name.trim().to_string(); // Parse URI versions like files/git/etc. diff --git a/lockfile_generator/src/pip.rs b/lockfile_generator/src/pip.rs index d26ff8f..1ceece5 100644 --- a/lockfile_generator/src/pip.rs +++ b/lockfile_generator/src/pip.rs @@ -97,6 +97,7 @@ impl Generator for Pip { // Create the virtual requirements.txt lockfile. let mut lockfile = String::new(); for package in report.install { + // TODO: Normalize this name let name = package.metadata.name; if package.is_direct { let _ = writeln!(lockfile, "{} @ {}", name, package.download_info.url); ```

cd-work commented 10 months ago

1273 has been dismissed, is this even still useful without it?

maxrake commented 10 months ago

1273 has been dismissed, is this even still useful without it?

I still think so. Here is another example that shows how the results of parsing the same input package results in different names, where the only difference is whether it gets parsed as a lockfile or manifest.

❯ phylum --version
phylum v5.8.1

❯ cat requirements.txt
RuAmEl...YAML---CLib==0.2.8

❯ phylum parse --lockfile-type pip requirements.txt
[
  {
    "name": "RuAmEl...YAML---CLib",
    "version": "0.2.8",
    "type": "pypi",
    "lockfile": "requirements.txt"
  }
]

❯ vim requirements.txt

❯ cat requirements.txt
RuAmEl...YAML---CLib

❯ phylum parse --lockfile-type pip requirements.txt
Generating lockfile for manifest "requirements.txt" using Pip…
[
  {
    "name": "ruamel.yaml.clib",
    "version": "0.2.8",
    "type": "pypi",
    "lockfile": "requirements.txt"
  }
]

The expectation is that name of the parsed package is normalized and will be the same in both cases...ruamel-yaml-clib in this case.

kylewillmon commented 10 months ago

My concern here is that "normalization" is not the correct behavior. As can be seen from your example:

❯ phylum parse --lockfile-type pip requirements.txt
Generating lockfile for manifest "requirements.txt" using Pip…
[
  {
    "name": "PyYAML",
    "version": "6.0.1",
    "type": "pypi",
    "lockfile": "requirements.txt"
  },
  {
    "name": "Pillow",
    "version": "10.1.0",
    "type": "pypi",
    "lockfile": "requirements.txt"
  }
]

These package names are provided by pip because lockfile generation happened, but they are not "normalized". Instead, they are the canonical names for these packages... The normalization rules that you provide are only used for deduplication.

The canonical name for a package can only be known by looking up that package in the registry. Which leaves us with two options:

  1. Output the normalized name instead of the canonical name
  2. Make network requests to the registry to find the canonical name

Option 2 is clearly outside the scope of a lockfile parser and Option 1 leads to results that are technically incorrect...

And just like #1273, this issue really only exists when using requirements.txt files as lockfiles... Because any lockfile generator should convert package names to the canonical name.

So unless there is an option that I am missing, I don't think there is any reasonable action that we can take to resolve this issue.

maxrake commented 10 months ago

Okay...points taken. This issue will be closed.