ossf / osv-schema

Open Source Vulnerability schema.
https://ossf.github.io/osv-schema/
Apache License 2.0
176 stars 75 forks source link

Proposal: Supporting Maven registries #208

Closed oliverchang closed 5 months ago

oliverchang commented 10 months ago

The current Maven ecosystem definition is "The Maven Java package ecosystem. The name field is a Maven package name.", which is a little vague.

We should clarify that this is referring to Maven Central by default, and if a different registry is required, allow the ecosystem to be "Maven:", similar to the Linux distro definitions (e.g. "Debian:7").

The default (Maven Central) should always be used/preferred where it makes sense.

oliverchang commented 10 months ago

@darakian @cuixq thoughts?

cuixq commented 10 months ago

I think this makes sense, and here are a few things about this in my mind:

rhalar commented 10 months ago

Note that there may also be different targets per version for one repository, on some Maven packages, e.g.

https://mvnrepository.com/artifact/co.fs2/fs2-io

darakian commented 10 months ago

The default (Maven Central) should always be used/preferred where it makes sense.

100% agree. Perhaps it makes sense to add an optional field to the OSV payload to define a registry. eg.

{
      "package": {
        "ecosystem": "Maven",
        "name": "io.netty:netty-handler"
        "registry": "mycoolregistry.com"
      },

Where the field missing would be equivalent to "registry": "https://repo.maven.apache.org/maven2" I think the registry term should refer to the directory listing of packages, but maybe that's up for discussion.

So, the downside to this approach would be that we could not have a single advisory which may apply to a coherent version range which was uploaded to multiple registries. eg. https://mvnrepository.com/artifact/org.apache.activemq/activemq-core

joshbressers commented 10 months ago

It might be wise to think about this in larger context. It's not hard to imagine a universe where there are multiple registries (not just Java) hosting the same artifacts. Containers are another easy example here, there are many registries serving up the same content.

It would be fair to set some boundaries, like the artifact name and version should be consistent across repositories

I agree with @darakian, defining a registry as a URL would allow for flexibility and some future proofing

joshbressers commented 8 months ago

I would like to give this a bump as it's become stale

oliverchang commented 8 months ago

Thanks for the reminder @joshbressers ! Let me write out a more detailed consideration of all the possibilities here, and their implications for both vulnerability database maintainers and vulnerability scanners.

Please let me know what everyone thinks on this!

The main two alternatives are:

1. Add a registry parameter to the package

    {
      "package": {
        "ecosystem": "Maven",
        "name": "io.netty:netty-handler"
        "registry": "mycoolregistry.com"
      },

2. Use ":" to define registries, with ecosystem-specific rules.

   {
      "package": {
        "ecosystem": "Maven:https://mycoolregistry.com",
        "name": "io.netty:netty-handler"
      },

There is a third alternative, or defining every single Maven registry out there as its own ecosystem, but this is infeasible as @cuixq points out in https://github.com/ossf/osv-schema/issues/208#issuecomment-1780403666.

Handling overlaps between registries

In both cases, one can of worms that multiple registries open up is conflicting packages across different registries (i.e. the same package name across different registries). This makes it difficult to determine which registry to specify in an advisory for a database maintainer, and how vulnerability scanners should behave.

Most of the ecosystems that OSV supports today have a single, default/canonical repository (i.e. Maven Central in Maven's case) where the vast vast majority of open source packages live. This should be the default and preferred registry used by both database maintainers and vulnerability scanners.

Using Maven as an example, where a registry is a (possible subset) mirror of Maven Central, users of that registry can still directly use OSV advisories that key on Maven Central. These mirrors may also contain a small number of packages where vulnerabilities are fixed faster than Maven Central (or have fixes backported to older release branches), and in these cases VEX statements should be issued by the registry to communicate these to vulnerability scanners.

When a registry diverges from Maven Central in that it can no longer be considered a mirror (i.e. naming and version conflicts exist), then that should be considered separately. There will need to be a separate vulnerability database for this registry, and vulnerability scanners will need to explicitly opt out of Maven Central scanning for these registries. This will be very rare -- @cuixq also did some analysis a while back on Maven that found that package conflicts against Maven Central in other registries was extremely minimal.

Which alternative?

Now, which alternative should we choose?

Option 1 is attractive because it's a structured field that makes parsing trivial. However, this has a major downside: Most OSV ecosystems tie the registry into the ecosystem definition. So far, Maven has been the only case where we need to support multiple registries inside the ecosystem, so this field will be redundant/confusing for most ecosystems.

@joshbressers brought up containers as another example where a registry field would be useful, but this is just one more example out of the 22 ecosystems we have today that don't need it.

There's also a separate discussion to be had whether or not it makes sense for advisories (as opposed to VEX) be issued against containers themselves -- when most vulnerability scanners today traverse the container to discover vulnerabilities inside it already.

Option 2 has the downside that it's a little bit ugly, but it is consistent with existing OSV conventions (e.g. Linux distros use ":" to qualify their releases in the ecosystem string -- "ecosystem": "Debian:10"). Regarding extensibility, if there are more general qualifiers required in the future that apply to most ecosystems, we can still consider separate fields for them.

Conclusion?

Option 2. seems like the most consistent with what we have today, and what we should proceed with.

rhalar commented 8 months ago

Another alternative which seems very in line with 'Option 2' would be to encode this information in purls perhaps? It is a field which already exists in package and has qualifiers which could be used for purposes such as these. They would work even for the Debian case, and maybe for things such as https://github.com/ossf/osv-schema/issues/202?

It's unfortunate that purl fields are optional but the default could just be Maven Central. But it would have to be enforced that entries with the same ecosystem/name must have purls with a different registry qualifier.

joshbressers commented 8 months ago

@oliverchang I think Option 2 sounds sane. There's nothing we can pick that will make everyone happy, and option 2 should catch a lot of the current examples

darakian commented 8 months ago

I think I prefer option 1 based on the aesthetics of it, but I don't have any real complaint about option 2. I do agree that we probably want to avoid having multiple registries on a single ecosystem where possible too.

@chrisbloom7 do you have any thoughts?

pombredanne commented 7 months ago

@oliverchang

Summary:

Details:

FWIW, the PURL way is to use a qualifier with the repository_url for things off Maven central. It has worked nicely so far.

repository_url is an extra URL for an alternative, non-default package repository or registry. When a package does not come from the default public package repository for its type a purl may be qualified with this extra URL. The default repository or registry of a type is documented in the "Known purl types" section.

If you are not adopting PURL, I would suggest at the minimum to keep similar names for similar concepts and use "repository_url" and not "registry".

I would strongly advise against stuffing multiple attributes in a single field as in your solution 2. "ecosystem": "Maven:https://mycoolregistry.com" as this is forcing yet another layer of parsing on downstream users and tools.

I would also suggest to look beyond Maven, as alternative repositories are common for RPMs, NuGet and other package types.

But again, I would strongly suggest that you adopt PURL as a base, as many of the questions you have raised here and in https://github.com/ossf/osv-schema/issues/202 (for sources) may have been addressed in the spec and would best addressed in PURL otherwise. Each time you adopt a different naming convention and attribute names, you are effectively making it harder for users and the community: their SCA tools and SBOMs return PURLs and they need to add a (likely lossy) translation layer between these PURL data and OSV's. Not a happy thing IMHO.

@darakian

I do agree that we probably want to avoid having multiple registries on a single ecosystem where possible too.

The way you interact with any Maven repository is always the same, and each Maven repository does not define a new type of package, just a different base package repository URL, and IMHO never a different, new "ecosystem". While the bulk of things come from the default repo

oliverchang commented 7 months ago

Thanks for the feedback @pombredanne !

Re adopting PURL as the base, it would be a rather large breaking change for the schema. PURL is absolutely supported for interop, and https://osv.dev provides conversion between PURL and OSV types in both its re-exported entries and the API, to avoid conversion pain for users.

Re fitting the attribute inside "ecosystem", it fits with our existing conventions in OSV for e.g. Linux distro ecosystems such as

      "package": {
        "ecosystem": "Debian:11",
        "name": "libgit2"
      },

Re alternate repositories in general, as discussed in https://github.com/ossf/osv-schema/issues/208#issuecomment-1880411733 it seems like for the vast majority of ecosystems, vulnerability databases would be unlikely to encode vulnerabilities outside of the main default registries. Maven happens to be an outlier, because of large alternative registries such as https://repo.jenkins-ci.org/releases/ or https://maven.google.com/

Are there examples in other such ecosystems? That would make a stronger case for adding e.g. repository_url. For cases like rpm, OSV's approach has to been to encode the individual distros as separate ecosystems, as an "ecosystem" in OSV just refers to a defined namespace (typically both the packaging mechanism and the actual registry).

rhalar commented 7 months ago

I also advocated for PURLs in https://github.com/ossf/osv-schema/issues/208#issuecomment-1880710231 and I agree that would perhaps be a good course of action, it would help with a lot of things down the line too, e.g. https://github.com/ossf/osv-schema/issues/202, and my note in the same issue about different builds of e.g pypi packages (wheel, egg, sdist) or different builds on conan.io.

We've already encountered situations where only one them is actually malicious/vulnerable while the others were not. Vulnerabilities for specific architectures or OSs also come up.

Also, some Maven packages for Scala have builds for different targets, e.g https://mvnrepository.com/artifact/co.fs2/fs2-io But vulnerabilities are reported on the base package, e.g. https://osv.dev/vulnerability/GHSA-2cpx-6pqp-wf35.

Though all of that could get resolved with a binary artifact field the as discussed in https://github.com/ossf/osv-schema/issues/202.

PURLs would perhaps allow encoding ecosystem specific information such as affected functions for crates, or affected imports and paths for Go, if that seems of interest.

darakian commented 7 months ago

I'm aligned with @oliverchang here. I don't see a reason to make defining an alternate source of packages more complicated than it needs to be. PURLs would also be inconsistent with any advisories which apply to a range of packages rather than to a single specific version.

The way you interact with any Maven repository is always the same, and each Maven repository does not define a new type of package, just a different base package repository URL, and IMHO never a different, new "ecosystem". While the bulk of things come from the default repo

Each Maven package registry does define a new package namespace though and that needs to be accounted for.

rhalar commented 7 months ago

I'm not too opposed to other solutions but I guess what we are worried about is that any new issues that pop up will necessitate further additions and new fields to the format. The PURL seems like a natural choice since it is designed to be 'ecosystem specific' in a way and it already exists in the schema.

Note that to resolve both this issue and https://github.com/ossf/osv-schema/issues/202, alterations need to be made to the package object, that's exactly the place where the purl resides :)

PURLs would also be inconsistent with any advisories which apply to a range of packages rather than to a single specific version

A PURL can identify a package only, the version component is optional. It is effectively the same solution as Option 1 or 2, just encoded differently. If you meant multiple binaries per source package, then yes, each package would need to be a separate entry.

There are other challenges in using the purl field tho, since it is currently optional. And there would need to be some kind of implicit defaults for every ecosystem.

oliverchang commented 6 months ago

Thank you all for the feedback re PURL. While the OSV-Schema absolutely supports PURL for interop (as discussed in https://github.com/ossf/osv-schema/issues/208#issuecomment-1938282957), it's not possible to move to PURL as the source of truth for identifying packages without a large breaking change requiring migration from the current data providers. There are also certain underspecified parts in the PURL spec (e.g. https://github.com/package-url/purl-spec/issues/247) that would have to be resolved.

I'll open a PR for option 2 of https://github.com/ossf/osv-schema/issues/208#issuecomment-1880411733 shortly. I'm not seeing any answers/examples of other ecosystems that would point to the necessity of a repository_url/registry (given that OSV ecosystems define a namespace that includes the registry in almost all cases).