Open elon0823 opened 2 years ago
This is my proposal idea. I think that data has period at created time. Can the time be used by our deletion point?
For this work, we need additional configuration for data-priod
value.
I think, using a data-period
as deletion strategy is convenient, but it could be dangerous like such case that massive data published without the data-period
elapsed. Just to distinguish old data, using the offset
is far enough.
But, it seems good when it comes to using the data-period
to grouping the data as an age of it so that the broker can delete records with the oldest data-period
when the disk is almost full.
I had considered to append the data-period to value of the record, but I thought it is much more convenient to add new column family for data-period when it comes to finding expired records.
Add new column family(ExpCF
) for data-period with designed row-key below.
byte-expression : [exp-timestamp][topic-name][fragment-id][offset]
byte-length : 8(uint64) any 1(uint8) 8(uint64)
When broker received a record with the data-period
, calculate the timestamp of expiration date(exp-timestamp
) from the data-period
.
Then, the expiration detector will iterate the expiration-date-ordered ExpCF
column family from start and easily find out sequence of records to erase.
And it takes O(1) time on finding each record to delete because the [topic-name][fragment-id][offset]
is the row-key of each record.
In addition, the data-period
field should be added to publish
message. How much do you think the max and default period of data? @1dennispark
change terminology data-period
to retention-period
Is it okay to delete the oldest records when the disk is full? We should consider additional policies for data-period to delete records properly when the system's disk is full. (Such as limiting the total number of records that can be stored per topic)