sigstore / rekor

Software Supply Chain Transparency Log
https://sigstore.dev
Apache License 2.0
904 stars 164 forks source link

[FR] Bulk download between recent checkpoints? #2098

Open woodruffw opened 7 months ago

woodruffw commented 7 months ago

I had this idea while playing around with my own monitoring tool, curious to hear what the Rekor folks think 🙂 -- if you think it's too complicated or otherwise not worth the effort please close!

Description

Right now, a real-time log monitor might have an event loop like this:

  1. Persist the last observed checkpoint
  2. Wait until a new checkpoint appears
  3. Audit all entries in the range [old, new)

To do (3), the monitor calls /api/v1/log/entries/retrieve repeatedly for ranges of indices in [old, new), which each call only handling a maximum of 10 indices. Current typical checkpoint ranges include a few hundred entries, meaning that the retrieval loop takes a decent amount of time (and that monitoring requires more fallible network round-trips than strictly necessary).

My proposal: For the last N checkpoints (pick N to balance size tradeoffs), Rekor could bundle the entries between adjacent checkpoints into singular payloads. These payloads could then be made available via an endpoint like /api/v1/log/entries/retrieve/by-checkpoints (or similar), where the request to that endpoint specifies the checkpoint span.

Pros:

Cons:

TL;DR: Rekor could bundle ranges between pairs of recent checkpoints to accelerate a common monitor retrieval pattern. This would reduce network traffic and improve monitor performance, at the cost of some additional storage and server complexity.

haydentherapper commented 7 months ago

Could this instead be a general purpose batch retrieval API, rather than specifically for checkpointing?

I had started implementation on this awhile ago but didn't get a chance to finish. The only thing to deal with is deciding whether the index you're querying by is the "global" log index, meaning you need to handle cross-shard lookups, or the shard-specific index, meaning you need to specify a tree ID too. I would prefer the latter, though it does make the API look different than the other APIs that are shard-agnostic.

woodruffw commented 7 months ago

Could this instead be a general purpose batch retrieval API, rather than specifically for checkpointing?

I think so, yeah! I emphasized checkpointing above because it's what I was looking at for my hacky monitor, but I see no reason why it needs to be constrained to that 🙂

I would prefer the latter, though it does make the API look different than the other APIs that are shard-agnostic.

That makes sense to me -- my 0.02c is that I don't mind a slightly more complicated/shard-aware client side API if the retrieval performance is worth it!