sigstore / model-transparency

Supply chain security for ML
Apache License 2.0
84 stars 24 forks source link

APIs design #172

Open laurentsimon opened 2 weeks ago

laurentsimon commented 2 weeks ago

This issues proposes what the (long-term) APIs will look like. Looking for comments, nothing set in stone.

1. Hash engine

Tracked in https://github.com/sigstore/model-transparency/issues/140 Why:

class hash_engine:
  def name() -> str:
  # Shard size
  def shard() -> int
  # Chunk size
  def chunk() -> int
  # Standard hash APIs
  def update(...):..
  def final():..

We will provide default hash engine that we use in this library, with the possibility to customize its parameters:

class shah256pv1:
   def __init__(self, shard=1000000, chunk=memory_available())
   # All class function from hash_engine

A hash name is parameterized. I suggest something simple like <name>$param1$param2.... For the existing sha256p, it could be sha256pv1$1000000 for a shard of 1 GB.

2. Serializer

This will serialize a model (folder or file). Why: some callers may want to serialize models using our library but not sign it with our library.

class serializer:
   def __init__(self, hash: hash_engine)
      ...
   def run(
          # Path is a file-system path.
          # PathIter is a path iterator, for callers who want to keep things in memory.
          input: Path | PathIter,
          # A list of paths to recompute Tracked in https://github.com/sigstore/model-transparency/issues/160
          recompute_paths: []Path = None,
          # HashedPathIter is an iterator for path+hash computed
          ) -> HashedPathIter:

     ...
)

3. Signer / Verifier

This will create a generic signer class that can be instantiated for Sigstore, PKI, etc.

class Signer:
  def sign(
          # Path is a file-system path.
          # PathIter is a path iterator, for callers who want to keep things in memory.
          # HashedPathIter is an iterator for pre-computed path+hash.
          inputs: Path | PathIter | HashedPathIter,
    ) -> bytes
     ...

class Verifier:
  def verify(
          input: Path | PathIter | HashedPathIter,
          sig: Path | bytes,
          ) -> bool

Example for Sigstore:

# sigstore.py
class SigstoreSigner(Signer)
  def __init__(self, rekor_url, fulcio_cert, ...)
  def sign(input) -> bytes

class SigstoreVerifier(Verifier)
  def __init__(self, rekor_url, fulcio_tuf, ...)
  def verify(input, sig) -> bool

Example for PKI:

# pki.py
class PKISigner(Signer)
  def __init__(self, certs_chain, key, ...)
  def sign(input) -> bytes

class PKIVerifier(Verifier)
  def __init__(self, certs_chain, ...)
  def verify(input, sig) -> bool

4. model sign / verify

This is the main entry point for callers to use.

# model.py
def sign(
      input: Path | PathIter | HashedPathIter,
      sig: Path  = defaultSigPath(input),
      signer: Signer = SigstoreSigner(),
      he: hash_engine = shah256pv1(),
      recompute_paths: []Path = None,
      ignored_paths: []Path = [".git"],
    ) -> bytes
   # High level logic.
   s = serializer(he)
   iter = s.run(input, recompute_paths)
   m = manifest(iter)
   c = signer.sign(m)
   sig.write(c)
   return c

def verify(
      input: Path | PathIter | HashedPathIter,
      sig: Path | bytes = defaultSigPath(input),
      verifier: Verifier = SigstoreVerifier(),
      he: hash_engine = shah256pv1(),
      ignored_paths: []Path = [".git"],
    ) -> bool
   # High level logic.
   s = serializer(he)
   iter = s.run(input, recompute_paths)
   m = manifest(iter)
   return verifier.verify(m, sig)
laurentsimon commented 2 weeks ago

@susperius thoughts?

mihaimaruseac commented 2 weeks ago

I think we also need some lower level API for incremental hashing.

laurentsimon commented 2 weeks ago

I think the serializer API takes care of it, it takes as input recompute_paths, more generally it takes as input whatever we decide in https://github.com/sigstore/model-transparency/issues/160.

mihaimaruseac commented 1 week ago

For the hash engine: does it take as input a file to hash or a memory string? Having shard and chunk signals a file-based API, but update and final signal a memory string, reading from other place.

mihaimaruseac commented 1 week ago

For the hash engine: does it take as input a file to hash or a memory string? Having shard and chunk signals a file-based API, but update and final signal a memory string, reading from other place.

In #188, I moved shard to be only informative here. Working on the next level of API where shard is actually relevant.

If we kept shard at the hash-an-object level where we would have implemented the multi-process queue, then we could have gone to a state where too many threads are started.

udaysavagaonkar commented 3 days ago

How is the hash_engine expected to be used? Will the serializer call update() and final() methods on the hash_engine? If that is the case, why does the hash_engine need to know what the shard size and chunk size is? I am sure I am missing something.

laurentsimon commented 3 days ago

Good observation, I think @mihaimaruseac had a related comment in https://github.com/sigstore/model-transparency/issues/172#issuecomment-2122898110

I think it depends what level of abstraction we're talking about: the underlying (traditional) hash engine (sha256, etc), the shard-aware hash, something else. The shard-aware hash engine would use a lower-level hash engine under the hood (eg sha256). It may also depends if we're talking about streaming engine or not.

https://github.com/sigstore/model-transparency/pull/188 has moved out the chunk size and shard out of the engine for now so we'll see if the original API def is needed or not. PTAL we'd love your feedback too

mihaimaruseac commented 3 days ago

Will the serializer call update() and final() methods on the hash_engine?

That is correct. I'll implement this in #190 (WIP at the moment, but should push a new change tomorrow-ish).

If that is the case, why does the hash_engine need to know what the shard size and chunk size is? I am sure I am missing something.

Chunk size doesn't matter, see #188. Shard size matters, but more just for verifications.

font commented 2 days ago

Based on the PKISigner and PKIVerifier examples, it looks like we're looking to add support for both keyed and keyless signing flows, and I think we should. If so, the PKISigner and PKIVerifier classes need to account for supplying the necessary keys. Not sure how official this API spec is, but just wanted to call it out JIC.

laurentsimon commented 2 days ago

The keyless flow will be handled by SigstoreSigner / SigstoreVerifier (with optional fulcio / reko parameters for private developments). The PKISigner / PKIVerifier is exclusively for "private" PKI deployment, ie existing PKI using existing "off the shelf" CAs or custom CAs (possibly using TUF-managed keys). See a PoC for the PKI https://github.com/sigstore/model-transparency/pull/177. We'll need to support TUF-managed and HSM keys: the API in the issue description has a simple private key bytes input but I think we'll need something like a KeySigner / KeyVerifier interface instead (providing raw sign / verify). /cc @udaysavagaonkar @susperius

There may be a case for supporting sigstore keyed flow, but I don't think we're prioritizing it atm.

Not sure how official this API spec is, but just wanted to call it out JIC.

Nothing official, just the overall direction :) Feel free to comment.

mihaimaruseac commented 1 day ago

+1, the proposal in the first message here is mostly informative, as you see in #188/#190 things are changing. For #190 I still need to find way to incorporate the current serialization format, but I have some ideas on that, which I'll test by next week.

Any comments and suggestions for improvement, as well as PR reviews, are very very welcome