Add type for generic package

eddiezane commented 2 years ago

Description

As language package managers (PyPI, Ruby Gems, etc) begin to adopt sigstore for signing their packages we may want a generic type that represents a "package.

The initial thought is to leverage https://github.com/package-url/purl-spec in this type.

dlorenc commented 2 years ago

+1 to the idea. I think most can be handled with the standard rekord type but with custom index keys for searching.

znewman01 commented 2 years ago

See also #845

di commented 2 years ago

I'm trying to understand the motivation / usecase here: is this just to add an additional search filter?

znewman01 commented 2 years ago

is this just to add an additional search filter?

Mostly!

The Rekord and HashedRekord types have very minimal metadata—they just have the hash of the artifact and a signature.

When we've worked with other packages in the past (e.g., RPMs) it's been useful to stick things like the name of the package in there. This is nice for searching and to see the whole history of a package and do other interesting analyses.

We could just stick ad-hoc metadata in Rekords/HashedRekords but we were thinking that it seems silly if PyPI uses slightly different field names from RubyGems, for instance.

di commented 2 years ago

In that case, I guess my follow up question is: what kind of expectations are we setting around whether the metadata for a generic or ecosystem-specific package type is "correct" or not?

E.g. if I add an entry with a PyPI PURL, are there any guarantees that a) it's actually describing a Python package b) it's actually the package in question c) I have any right to publish entries about that package?

znewman01 commented 2 years ago

Yeah, that's a great question. I think it's entirely out-of-scope for Rekor to do such validation, because we'd have to teach it about each language ecosystem, and I'd like to strongly discourage any applications that assume that "artifact X is in Rekor" implies "artifact X is authentic."

This of course means that we have to plan for malicious entries in the log, and deal with spam. This shouldn't be a problem for package manager clients trying to check authenticity (since we're not relying on Rekor being a first-class data source), but it could be a problem for search/other analysis.

So the question is: are the benefits of this metadata worthwhile even with the possibility of spam/malicious artifacts? I think yes—I don't expect users to directly be doing this searching; it's more for analysis by repository operators, security researchers, and monitors. And it'd be possibly, by cross-referencing the repositories and Fulcio, to filter out improper entries.

Would be curious if anybody else has a different take.

Getting more speculative—an alternative approach might be a model where each package repository maintains its own, independent ledger that's stored on Rekor. So then I can query Rekor for "all package records with scheme 'python' that's signed by " and see the complete history there.

lkatalin commented 2 years ago

Thanks all for this helpful discussion!

E.g. if I add an entry with a PyPI PURL, are there any guarantees that a) it's actually describing a Python package b) it's actually the package in question c) I have any right to publish entries about that package?

Yeah, that's a great question. I think it's entirely out-of-scope for Rekor to do such validation, because we'd have to teach it about each language ecosystem, and I'd like to strongly discourage any applications that assume that "artifact X is in Rekor" implies "artifact X is authentic."

@bobcallaway This is the question I was trying (and perhaps failing) to ask in this thread - if it's out of scope for Rekor to check (b) specifically, it means we don't necessarily need Rekor to validate that an RPM header matches a specific payload during artifact upload (which has implications for future rpmv4 support). I thought from your answer that this validation was in scope, but maybe I misread, or maybe it depends on the Rekor type? (And if so, do we want different validation behavior depending on whether an RPM for example is uploaded as an RPM type, a generic package type, or a rekord / hashedrekord type?)

I'm still getting familiar with all the different types Rekor supports and their guarantees, so apologies if I'm confusing anything.

di commented 2 years ago

Yeah, that's a great question. I think it's entirely out-of-scope for Rekor to do such validation, because we'd have to teach it about each language ecosystem, and I'd like to strongly discourage any applications that assume that "artifact X is in Rekor" implies "artifact X is authentic."

I agree, which makes me feel like this would be kind of meaningless and potentially full of false positives unless we had some other, verified attribute that we could filter with... such as a signature from an identity for the project provided by the ecosystems IdP.

So maybe this is only something Rekor would support for repository IdPs it supports (which is currently 0, but should be non-zero soon).

jchestershopify commented 2 years ago

I think we'd need to solve the problem of misleading entries (which I hadn't considered, shame on me) before we could roll this out.

Another issue to consider is that we'd want to be (1) detailed enough to be useful for multi-ecosystem monitoring, while (2) being modest enough so that ecosystems can extend to fit their own cases without having impedance mismatches. That will be a tricky line to walk.

sigstore / rekor

Add type for generic package #804