Support record rotation for overwriting old records

elon0823 commented 2 years ago

Currently, topic record is not shrinking, but only growing.
Old record should be deleted or overwritten with new record

1dennispark commented 2 years ago

This is my proposal idea. I think that data has period at created time. Can the time be used by our deletion point? For this work, we need additional configuration for data-priod value.

elon0823 commented 2 years ago

I think, using a data-period as deletion strategy is convenient, but it could be dangerous like such case that massive data published without the data-period elapsed. Just to distinguish old data, using the offset is far enough.

But, it seems good when it comes to using the data-period to grouping the data as an age of it so that the broker can delete records with the oldest data-period when the disk is almost full.

elon0823 commented 2 years ago

New column family for data-period

I had considered to append the data-period to value of the record, but I thought it is much more convenient to add new column family for data-period when it comes to finding expired records.

Proposed Idea

Add new column family(ExpCF) for data-period with designed row-key below.

byte-expression : [exp-timestamp][topic-name][fragment-id][offset]
byte-length     :      8(uint64)      any      1(uint8)   8(uint64)

When broker received a record with the data-period, calculate the timestamp of expiration date(exp-timestamp) from the data-period. Then, the expiration detector will iterate the expiration-date-ordered ExpCF column family from start and easily find out sequence of records to erase. And it takes O(1) time on finding each record to delete because the [topic-name][fragment-id][offset] is the row-key of each record.

In addition, the data-period field should be added to publish message. How much do you think the max and default period of data? @1dennispark

elon0823 commented 2 years ago

change terminology data-period to retention-period

elon0823 commented 2 years ago

Is it okay to delete the oldest records when the disk is full? We should consider additional policies for data-period to delete records properly when the system's disk is full. (Such as limiting the total number of records that can be stored per topic)

paust-team / pirius

Support record rotation for overwriting old records #186

New column family for data-period

Proposed Idea