[x] Record the stream's last compaction timestamp, and compare on startup. Base initial delay on delta. This ensures that a periodically restarting stream partition (which shouldn't happen) doesn't miss compaction.
[x] Per partition compaction based on id+source.
Should we preserve only the latest? Or should we preserve the original only as we know it is a duplicate event according to the CloudEvents1.0?
Should we just not worry about this at all? Maybe there is no value add given other compaction strategies and the fact that this does not actually guarantee that there will never be duplicate events across partitions?
Not going to implement this bit, as we gain little benefit and sacrifice performance a bit on the write path.
[x] Per partition timestamp based truncation. As events pass out of the TTL threshold, events are truncated.
[x] Event batches should have a timestamp written into a secondary index for the last event's offset per batch.
[x] Stream CRDs should be updated to include a compaction policy sub-structure. Users should be able to specify the retention policy; only time-based retention is currently supported.
[x] Stream controller should check the earliest value in the timestamp secondary index, and when the time elapses the configured retention policy, it should spawn a task to prune the old data.
[x] Update operator to pass along retention policy data to stream statefulsets.
Observations: for any system which does not have transactional integration directly with the full Hadron Stream, there is no way to guard against duplicate re-processing other than the transactional processing model.
Pipelines
[x] Pipeline stage data should be deleted once the entire Pipeline instance is complete.
[x] Pipelines need to transactionally copy their root event to account for cases where the root event may be compacted away. Or we could to a pipeline offsets check first before we compact a range.
Streams
Per partition compaction based on id+source.Observations: for any system which does not have transactional integration directly with the full Hadron Stream, there is no way to guard against duplicate re-processing other than the transactional processing model.
Pipelines