giterlizzi / perl-URI-PackageURL

Perl extension for Package URL
Artistic License 2.0
4 stars 1 forks source link

[Draft] Package URL specifications for CPAN Packages #8

Open giterlizzi opened 1 year ago

giterlizzi commented 1 year ago

Package URL

A Package URL (aka "purl") is a URL string used to identify and locate a software package in a mostly universal and uniform way across programing languages, package managers, packaging conventions, tools, APIs and databases.

https://github.com/package-url/purl-spec

A purl is a URL composed of seven components:

scheme:type/namespace/name@version?qualifiers#subpath

Components are separated by a specific character for unambiguous parsing.

The defintion for each components is:

Package URL for CPAN Packages

Components

Minimal components:

Optional (but advised) components:

Qualifiers

Optional qualifiers may include:

Extras

Examples

Minimal "purl" string:

pkg:cpan/libwww-perl
pkg:cpan/Perl-Version@1.013
pkg:cpan/DateTime@1.55

"purl" string with namespace (author) component:

pkg:cpan/GDT/URI-PackageURL@2.02
pkg:cpan/SRI/Mojolicious@9.35

"purl" string with repository_url qualifier:

pkg:cpan/SRI/Mojolicious@9.35?repository_url=backpan.perl.org

"purl" string with vcs_url qualifier:

pkg:cpan/GDT/URI-PackageURL@2.02?vcs_url=git://github.com/giterlizzi/perl-packageurl.git
sjn commented 11 months ago

Hei!

I've been mulling about this ticket a while now; and here are a couple thoughts for your consideration. Please note that some of this is produced from memory, so it's possible that I may be mistaken on some points – please correct me if you find something wrong! (thank you :grin:)

PURL usage scenarios

  1. PURLs used to specify any dependency requirements (as opposed to dependency resolutions), including alternative PURLs for the same dependency made available in different packaging ecosystems. (The implementation of this isn't URI::PackageURL-specific, though)
    1. E.g. the following should eventually be possible: cpanm pkg:cpan/SRI/Mojolicious
    2. This means that a PURL should by default be resolvable to the common case, and that common-case URLs should be possible to be converted into correctly corresponding PURLs.
  2. PURLs used to specify dependency resolutions (as opposed to requirements), but limited to what was actually deployed, pinned or packaged.
    1. This means that it should be possible for a PURL to contain all necessary information necessary to correctly resolve to the package that was actually downloaded.
  3. The PURLs may refer to internal/private package indexes or repositories, including company CPAN mirrors, internal APT or RPM repositories, or other off-limit download locations, that are supported by the relevant tooling.
    1. Correspondingly, it should be possible to create a correct PURL from an internal download location like these.

Terminology

  1. Within the CPAN space, the following is a distribution name – SRI/Mojolicious-9.35.tar.gz
    1. A distribution name must contain the author's CPAN id, since it's possible for different people to make releases for the same distribution (!).
  2. The following is a module name – Mojo::Base
    1. A distribution contains one or more modules, but not necessarily in the same namespace as indicated by the distribution name.
    2. Proposal: When referring to a module, the PackageURL must use the keyword module in the namespace part of the PURL. This is to avoid namespace collision between CPAN id's and module names: pkg:cpan/module/Foo::Bar
  3. The naming resolution from module to distribution, is indexed in the 02packages.details.txt files on your mirror.
    1. This resolution is expected to be managed by the tooling used for downloading, unpacking, preprocessing, building, testing, and installing. E.g. cpan, cpanm, cpm and cpanp. Some tooling uses these indirectly, e.g. carton, carmel and dh-make-perl. Or even from CPAN mirror software like Pinto or CPAN::Mini or App::opan.
  4. When we refer to a 'package' we mean the module namespace specifically - even if it is defined in a file which doesn't match the module name.

SBOM Use

  1. After module name resolution:
    1. The module files that are installed from a distribution, are "stored" (lol) in .packlist and perllocal.pod files throughout the designated installation tree. These are less than ideal for figuring out the pedigree of an installed module.
  2. If a distribution is "installed" into a directory destined for inclusion into another packaging ecosystem (e.g. a dir that becomes part of a .dpkg package used by APT), it's common to just delete these files.
  3. With the new demands for SBOM files, we should expect that one SBOM file per distribution will be made, and stored somewhere. (At the time of this writing, this is unclear).

Sources

(Updated 2024-01-19)

sjn commented 11 months ago

Related, NIST has published a Software Identification Ecosystem Option Analysis where they talk a little about the contexts where PackageURLs may be used. Very useful reflections, and recommended reading.

They specifically look for something they call "Grouping", which they for some reason claim is a "missing feature" in purls. (I may have misunderstood something here).

Not sure of it's relevancy for this module either, but the idea is out there, so possibly necessary to consider.

sjn commented 9 months ago

Having thought a little more about this, I'm currently considering the following proposals....

  1. Since PackageURLs have at least two distinct "purposes", that would benefit from having separate API methods.
    1. A method for producing a "fully resolved" PURL, that is to be used to uniquely identify a specific package that has been used. This should include as much information as possible, including hostname/repository URL used when downloading a package, it's resolved version, and if possible, a sha256 checksum of the package. This same PURL should be possible to resolve to a valid download URL that the user can use to confirm that the package downloaded is (still) the same as the one published.
      1. The "fully resolved" PURL must be in the form of pkg:cpan/AUTHOR/Dist-Name@1.0?repo_url=…etc.
    2. A method for producing a "minimal" PURL, that is to be used for referring to CPAN package dependencies before they are resolved during a build stage.
      1. The "minimal" PURL may refer to a CPAN Distribution name <pkg:cpan/AUTHOR/Foo-Bar> OR a CPAN Module name <pkg:cpan/module/Foo::Bar>, at the package author's discretion.
      2. When referring to a module name, the PURL must have the word "module" in the namespace field, in order to distinguish between modules that are all uppercase (e.g. CGI) and CPAN author ids that are identical published module names.
  2. When package URLs are resolved, we should expect the client software to allow for any number of Package URLs to the same component to be listed as a dependency, and filter and pick the right one as needed.
    1. e.g. If a CPAN Distro depends on Foo::Bar, it may list the following dependencies, the build tool may shell out the task of installation to any viable alternative, depending on preference or policy. (e.g. by having a --prefer-cpan parameter to have the tool prioritizing downloading dependencies from CPAN, instead of shelling out to apt install libfoo-bar-perl on a Debian system)
      • pkg:cpan/module/Foo::Bar
      • pkg:apt/debian/libfoo-bar-perl
      • pkg:rpm/opensuse/foo-bar-perl

I guess I'm pretty much echoing what you've already have proposed, with the difference of explicitly adding "module" (in lowercase) to the PURL, to make it easily distinguishable from distribution names, which have to be in uppercase; And making a point out of having separate API methods that produce each of these explicitly.

So, with this I've been trying to think about about it from an "independent" starting point, and basically ended up where you and @mrdvt92 in #2 have arrived.

So for whatever it's worth, I'm happy to stand behind what's here, plus the perspectives in #2. :smiley_cat:

sjn commented 9 months ago

@giterlizzi, I just learned that the PackageURL spec author is working on getting it registered as an ECMA standard. Maybe it's time to get the CPAN bits included?

source: https://youtu.be/B2bVaaeqpAk?si=c7cdfDZCEJkucOic&t=623

sjn commented 9 months ago

By the way!

When in comes to specifying (pre-resolution) dependencies, there's a version-range spec for purl. Should we adopt this at the same time, while we're at it?

https://github.com/package-url/purl-spec/tree/version-range-spec

giterlizzi commented 9 months ago

Maybe it's time to get the CPAN bits included?

Yes, I think we can start validating the specification described in the first comment (Components and Qualifiers) and open a PR to include it in https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst

sjn commented 9 months ago

Apparently, there's a pull request open already at https://github.com/package-url/purl-spec/pull/155 - maybe worth updating?

Also, I expect to meet the purl author, Philippe Ombredanne, in Brussels tomorrow. If you want, I can ask him what's needed to get this PR merged?

giterlizzi commented 9 months ago

Apparently, there's a pull request open already at package-url/purl-spec#155 - maybe worth updating?

If you agree I would modify it like this:

cpan

cpan for CPAN Perl packages:

Also, I expect to meet the purl author, Philippe Ombredanne, in Brussels tomorrow. If you want, I can ask him what's needed to get this PR merged?

It would be great. Thank you!

mrdvt92 commented 9 months ago

The name is the module or distribution name and is case sensitive.

The more I think about it, I believe only CPAN distributions should be supported and not modules or packages.

  1. A module (a .pm file) does not have a 1:1 relationship to a package. A module is a single file with zero or more packages inside it.
  2. A single module can be provided by multiple distributions.
  3. A package version does not have to be updated for each distribution.
  4. Modules do not technically have versions. A package can have a version but doesn't have to have a version.

I propose to only use /dist/ to match the meta URL e.g., https://metacpan.org/dist/Perl-Version

pkg:cpan/dist/Perl-Version@1.013

If we really must use modules, does each module in a distribution need to be specified?
Since modules don't really have versions, are checksum=sha:XXXXXX signature mandatory?

sjn commented 9 months ago
  • The namespace is optional; it may be used to specify the author name and it must be uppercased.

Aaah, no, let's NOT word it like this. Instead, I propose this -

Correct examples:

pkg:cpan/Perl::Version@1.013
pkg:cpan/DROLSKY/DateTime@1.55 (distribution name)
pkg:cpan/DateTime@1.55 (module name)
pkg:cpan/GDT/URI-PackageURL
pkg:cpan/LWP::UserAgent
pkg:cpan/OALDERS/libwww-perl@6.76
pkg:cpan/URI (module name)

Incorrect syntax examples:

pkg:cpan/Perl-Version@1.013
pkg:cpan/DateTime@1.55
pkg:cpan/GDT/URI::PackageURL
pkg:cpan/LWP-UserAgent
pkg:cpan/OALDERS/
sjn commented 9 months ago
pkg:cpan/dist/Perl-Version@1.013

If we really must use modules, does each module in a distribution need to be specified?

Modules do have versions (see https://www.cpan.org/modules/02packges.details.txt for documentation) When using a PackageURL to refer to a module, the intention is to a ecosystem-specific tool to resolve which distribution a specific module belongs to. This is already what happens when running cpanm Foo::Bar – the tool downloads 02packages.details.txt and does a lookup there to figure out which distribution to download. This lookup works with packages (defined as namespaces, of which you may have one or more off inside a .pm file) and with modules (defined as a .pm file with a single package namespace matching the file name), and distributions (a tarball containing one or more modules or packages).

Note also that a distribution name MUST contain the author's CPAN id to be valid! That's why I'm insisting that a PackageURL referring to a dist also must live up to this. (The reason why this is so, is that it's possible for several authors to make releases for the same distribution, and allow users later to refer to which of them they want)

giterlizzi commented 9 months ago
  • The namespace is optional; it may be used to specify the author name and it must be uppercased.

Aaah, no, let's NOT word it like this. Instead, I propose this -

* To refer to a CPAN distribution name, the namespace MUST be present. In this case, the namespace is the CPAN id of the author/publisher. It MUST be written uppercase, followed by '/' and then followed by the distribution name. A distribution name may NEVER contain the string '::'.

* To refer to a CPAN module, the namespace MUST be absent. The module name MAY contain zero or more '::' strings, and the Module name MUST NOT contain a '-'

Correct examples:

pkg:cpan/Perl::Version@1.013
pkg:cpan/DROLSKY/DateTime@1.55 (distribution name)
pkg:cpan/DateTime@1.55 (module name)
pkg:cpan/GDT/URI-PackageURL
pkg:cpan/LWP::UserAgent
pkg:cpan/OALDERS/libwww-perl@6.76
pkg:cpan/URI (module name)

Incorrect syntax examples:

pkg:cpan/Perl-Version@1.013
pkg:cpan/DateTime@1.55
pkg:cpan/GDT/URI::PackageURL
pkg:cpan/LWP-UserAgent
pkg:cpan/OALDERS/

I agree !

giterlizzi commented 9 months ago

@sjn Have added a initial check for "cpan" purl type

purl-tool pkg:cpan/GDT/URI::PackageURL
ERROR: Invalid Package URL: CPAN 'name' must have the distribution name

purl-tool pkg:cpan/URI-PackageURL
ERROR: Invalid Package URL: CPAN 'name' must have the module name

purl-tool pkg:cpan/G::DT/URI::PackageURL
ERROR: Invalid Package URL: CPAN 'namespace' must have the distribution author
sjn commented 9 months ago

If we can get a purl-spec PR for this made, we can have it merged lunchtime today! 🤩

giterlizzi commented 9 months ago

If we can get a purl-spec PR for this made, we can have it merged lunchtime today! 🤩

:smiley:

Changed the specification.


cpan

cpan for CPAN Perl packages:

sjn commented 9 months ago

Great! Do you have a PR link I can refer to?

giterlizzi commented 9 months ago

This is the new PR https://github.com/package-url/purl-spec/pull/288

sjn commented 9 months ago

One question;

Is it really necessary to mention MetaCPAN at all?

giterlizzi commented 9 months ago

One question;

Is it really necessary to mention MetaCPAN at all?

You mean this ?

To search CPAN it is recommended to use https://metacpan.org.

sjn commented 9 months ago

Congratulations with getting this merged into the spec! :-D

Now the work starts with getting purls supported in other parts of the Perl/CPAN toolchain!

(btw, I've tried to reach out to you on twitter/x; are there better channels for reaching you?)