sigstore / rekor

Software Supply Chain Transparency Log
https://sigstore.dev
Apache License 2.0
896 stars 164 forks source link

Proposal: Use GCP Datastream to ensure consistency of Redis entries #1633

Open jalseth opened 1 year ago

jalseth commented 1 year ago

Description

There have been times when Rekor is successful in writing a new entry to the transparency log, but fails to write to Redis. It would be possible to improve this by adding retry logic to the Rekor API, but there will always be some edge case where the API server is unable to write to Redis before the API server is shut down and loses its in-memory retry queue.

Rather than relying on the API server to guarantee writes to Redis, we can use the MySQL database as the source of truth. GCP Datastream is a serverless offering that integrates with databases and takes action when writes occur. Datastream does not currently (WIP) support taking events straight to GCP PubSub, but does support writing to GCS and GCS write events can be used to trigger PubSub which would then be consumed by a new job. The new job would only Ack the PubSub messages after the entry was successfully written to Redis.

Open Questions:

haydentherapper commented 1 year ago

I very much like this idea to use the DB as the source of truth and rely on GCP to guarantee entry upload side effects occur.

Should the API server still attempt to write to Redis and this would only be used for reconciliation, or should the API server not write to Redis at all and rely on this flow?

Given this feature would be exclusively for GCP, having both be supported would be ideal, for those who only want to use Redis. I would disable writing directly to Redis if a flag is enabled that says datastream is in use.

How to prevent abuse of principals with the ability to write to the GCS bucket? Can we rely on IAM or should the PubSub consumer job validate the entry against Rekor's signing keys?

I think IAM should be sufficient, though this should be fleshed out in a design.

How to handle the lifecycle for the temporary GCS objects? Is deleting all objects older than N days sufficient?

Should also consider multi-day outages. For example, if pub/sub is down for a few days, what happens if the temporary object has been deleted from GCS? Do we need a job to delete old entries from GCS?

bobcallaway commented 1 year ago

I like the idea as well.

This pattern could also be potentially supported with an OSS stack like Debezium when Rekor is run in other environments.

jalseth commented 1 year ago

Great! I didn't realize there was an OSS offering in this space.

I'll throw together a small design doc and we can discuss further, including the potential impact of compromised GCS.