Clarify that digests don't have to be cryptographic ones.

TomHennen commented 6 months ago

There are uses for DigestSets that don't involve cryptographic hashes and I don't think we ever meant to preclude those uses?

(This PR motivated by this discussion and another I've had with @laurentsimon).

Sometimes users have a need to refer to something by some other immutable identifier. Either because the content can't be hashed traditionally, because it's impractical to hash traditionally, or because they interact with the content through an interface that doesn't expose them to the entirety of the content.

In these situations users may wish to use other identifiers in a DigestSet. Those users should be careful to understand the trust that they're placing in the identifier to be sure that it meets their needs.

One concrete example of where a non-cryptographic hash can be useful is when referring to Virtual Machine images. Often these images are very large (so impractical to run a cryptographic hash over) and users often interact with them via APIs that the platform provides that don't involve the user having complete custody of the content. Platforms like AWS and GCP provide 'ids' for users to use when referring to these images. A user may say something like "create an instance with image 123". In that case the user doesn't actually have the bits that correspond to 'image 123' so they cannot digest it themselves. And by the time the image has started it can be difficult, if not impossible, to digest the original content that was used to boot the instance (this is probably a good use case for secure boot, but is out of scope here :)).

These IDs can often be treated as immutable and may be perfectly suited to users threat profiles. Allowing DigestSets to use these types of identifiers would allow providers to make statements about the content of these VM images using the identifiers their users have ready access to.

In addition, using an ID like this does not preclude including a cryptographic hash in the DigestSet as well. If possible including both may provide the most flexibility for the user's various use cases.

JustinCappos commented 6 months ago

How do you enforce immutability in this case? Or, to ask another way, why is immutability important if you're relaxing the constraint on cryptographically verification and trusting a third party?

TomHennen commented 6 months ago

How do you enforce immutability in this case? Or, to ask another way, why is immutability important if you're relaxing the constraint on cryptographically verification and trusting a third party?

Both good questions.

I'd expect the underlying implementation to ensure immutability to the extent required by users. It could be something like "there's a monotonic system that assigns new IDs for each change (like subversion does)" to something more complicated.

why is immutability important

Maybe it's not? Maybe it would be enough to say "with custom types the type must document what the properties are (immutable, cryptographic digest, ...)"? Mostly I'm trying to keep this change as small as needed but I don't think it's actually needed.

adityasaky commented 6 months ago

I think some examples and discussion in-spec about the change in properties would be helpful. I'm reminded a little of this TAP but I think the proposal here is looser still.

TomHennen commented 6 months ago

I think some examples and discussion in-spec about the change in properties would be helpful. I'm reminded a little of this TAP but I think the proposal here is looser still.

How do you think that would be best handled here? Inline in spec/v1/digest_set.md or in some new place? (I don't think in-toto attestations has such a place yet?)

TomHennen commented 6 months ago

Friendly ping?

SantiagoTorres commented 6 months ago

My understanding is that this can be handled by ITE-4 and addressed by the type of "transport" that's being used. We've been leaving how to handle these to the specification, given that there were not a lot of "non file-hash" based identifiers. I assume this is something we could start standardizing/settling along on attestation or somewhere else.

All in all, each "hashable" would need its own security analysis. I understand that some cases use identifiers which, essentially, transfer trust to a third party identifier (e.g., using an SVN checkout id). I think we could taxonomize this, as this will probably fall within just a handful of cases.

TomHennen commented 6 months ago

My understanding is that this can be handled by ITE-4 and addressed by the type of "transport" that's being used. We've been leaving how to handle these to the specification, given that there were not a lot of "non file-hash" based identifiers. I assume this is something we could start standardizing/settling along on attestation or somewhere else.

I'm not sure ITE-4 is a good fit in this case. From what I can tell ITE-4 provides an alternate way to name things, but it then still associates those things with the hash of their contents.

In this case we have users that need to define an alternative method to refer to the contents. E.g.

{
  "sha256": "abcd1234...",
  "vm_image_id": "12345",
}

The path to the image as represented elsewhere could be accomplished via ITE-4 though.

All in all, each "hashable" would need its own security analysis. I understand that some cases use identifiers which, essentially, transfer trust to a third party identifier (e.g., using an SVN checkout id). I think we could taxonomize this, as this will probably fall within just a handful of cases.

I agree that it could be taxonomized. I'd be curious to know if you think coming up with that taxonomy would be a blocker for this change or simply future work?

Also, do you think that such analysis would be needed for users that are using their own 'non-predefined' algorithm or only if the specific algorithms are 'standardized' in this spec?

TomHennen commented 6 months ago

Also, do you think that such analysis would be needed for users that are using their own 'non-predefined' algorithm or only if the specific algorithms are 'standardized' in this spec?

Perhaps it would be worth making an additional change in this PR that makes it more clear that users MAY define and use their own algorithms?

SantiagoTorres commented 6 months ago

I was wondering if that's in order, perhaps something that overrides ITE-4 with a better security notion about hashers/algorithms.

Technically, a collission resistant hash can be replaced with something else (e.g., an artifact directory) under a different, more centralized threat model. In that sense, then the hash objects become just raw identifiers from a particular location. It'd warrant a bit more exploration but I can see it opening doors in some cases

TomHennen commented 6 months ago

I was wondering if that's in order, perhaps something that overrides ITE-4 with a better security notion about hashers/algorithms.

Technically, a collission resistant hash can be replaced with something else (e.g., an artifact directory) under a different, more centralized threat model. In that sense, then the hash objects become just raw identifiers from a particular location. It'd warrant a bit more exploration but I can see it opening doors in some cases

Can you say what you're looking for in this PR more specifically?

marcelamelara commented 5 months ago

I don't think I've seen this discussed yet: I think the piece that I'm skeptical about is why the DigestSet is the most suitable field in an RD for this type of ID, even though the spec does allow custom digest algorithms. Maybe I'm missing something, is it because you still expect this ID to have the immutability semantics of a cryptographic digest? If not, could this non-cryptographic ID be specified in the name field to more clearly denote that it doesn't have the same semantics as a cryptographic digest?

TomHennen commented 5 months ago

is it because you still expect this ID to have the immutability semantics of a cryptographic digest?

Yes I do. That's definitely the case for GCE images and I suspect it's also the case for the Amazon ones.

FWIW the current PR does use 'immutable' and I'd be happy to keep it that way.

I'm realizing this was discussed in https://github.com/in-toto/attestation/issues/5.

I have a preference to use 'digest' for this purpose because:

The spec currently requires digest to be set (which could be changed)
The spec says things should matched purely by digest (which could be changed)
Infrastructure and tooling that is just looking for something that matches can no longer query just by digest type/value it must now conditionally support looking in other fields too.

I'm not really sure what the downsides would be apart from the name being somewhat confusing. Users still need to provide the digests they're planning to match against so I don't think people would wind up accidentally using something?

I'm curious to get @MarkLodato's thoughts though as he had a lot of input on #5.

MarkLodato commented 5 months ago

I'm OK abusing digest here. Basically we'd be saying that a subject is an immutable reference to a collection of artifacts, identified via digest. The strongly recommended, foolproof way to do this is via a cryptographic digest, which is why we call the field digest. However, if you know what you're doing, you can put some other non-cryptographic-digest identifier in that field, so long as it has the semantics of being an immutable reference and you accept the risks of the immutability breaking.

For example, GCE image IDs are effectively immutable unless something goes horribly wrong within GCE. That seems like an acceptable risk, and something I'd personally accept. You might not, so don't accept it, just like you might not accept SHA-1 or even SHA-256.

I don't like the idea of using another field because it makes the semantics so much more complicated, as @TomHennen said. Always matching by digest is straightforward. Let's keep it simple for the common case.

Regarding immutability, I think it's an inherent property of subject. Statement is an assertion that subject has property predicate without any sort of time bound, and that can only logically work if subject is immutable. (The predicate itself can have a time bound, but that is not a concept at the Statement layer.)

If you all agree, then I think we can fix this issue and #5 by updating the spec as such.

marcelamelara commented 5 months ago

@TomHennen @MarkLodato Thanks for your comments. I'm convinced that the digest field makes the most sense for any immutable identifier.

Per Mark's suggestion, this PR should also update the Statement level spec to make the immutability assumption for the subject crystal clear, closing out #5.

TomHennen commented 5 months ago

Thanks!

Updated statement, added rationale to the DigestSet docs per @adityasaky (but maybe he was suggesting something else?)

TomHennen commented 5 months ago

Thanks, Tom. I approve of the overall change. I just have some minor suggestions to improve clarity of the PR and content.

Thanks for the comments and suggestions. Done.

TomHennen commented 5 months ago

LGTM @TomHennen . Before I approve, can you please double-check that we don't need to update the spec version to 1.1 anywhere else? I want to make sure we don't end up with version inconsistencies across the spec docs.

Ah, so do we want to rev everything under spec/v1 to 1.1? It's actually not clear to me if the version at the top is for that document or for the attestation spec. I could see it going either way.

If it is for the whole spec perhaps I should remove the 'change logs' from each doc and have a centralized changelog.md to document those things?

marcelamelara commented 5 months ago

Ah, so do we want to rev everything under spec/v1 to 1.1? It's actually not clear to me if the version at the top is for that document or for the attestation spec. I could see it going either way.

Yeah, I'm actually not sure why we decided to have a version at the top for each document, my interpretation was that it matched the attestation spec, though maybe I forgot other reasons we had. But I had imagined that any semantic changes to the Statement layer, at least, would rev the entire attestation spec. And when I thought about versioning each layer individually, I wasn't sure we wanted to get into the territory of having mismatched versions of fields within a Statement. As in, what would it mean to have a v1.1 subject alongside a v1.0 (predicate layer)[https://github.com/in-toto/attestation/blob/main/spec/v1/predicate.md] that then contains v1.1 DigestSets? It's a little trickier with the bundle spec, since that isn't associated with a schema, per se. In any case, imo versioning each document within spec/v1 separately introduces more complexity than we need.

TomHennen commented 5 months ago

Ah, so do we want to rev everything under spec/v1 to 1.1? It's actually not clear to me if the version at the top is for that document or for the attestation spec. I could see it going either way.

Yeah, I'm actually not sure why we decided to have a version at the top for each document, my interpretation was that it matched the attestation spec, though maybe I forgot other reasons we had. But I had imagined that any semantic changes to the Statement layer, at least, would rev the entire attestation spec. And when I thought about versioning each layer individually, I wasn't sure we wanted to get into the territory of having mismatched versions of fields within a Statement. As in, what would it mean to have a v1.1 subject alongside a v1.0 (predicate layer)[https://github.com/in-toto/attestation/blob/main/spec/v1/predicate.md] that then contains v1.1 DigestSets? It's a little trickier with the bundle spec, since that isn't associated with a schema, per se. In any case, imo versioning each document within spec/v1 separately introduces more complexity than we need.

Ok, added CHANGELOG.md, updated version in README.md, removed 'version' tags from the other docs.

Leaving the folder as just v1 because that seems to be what we decided before?

marcelamelara commented 5 months ago

Leaving the folder as just v1 because that seems to be what we decided before?

Right. I think we decided we wanted to keep everything within the major version backwards compatible.

TomHennen commented 5 months ago

Thanks all!

in-toto / attestation

Clarify that digests don't have to be cryptographic ones. #338