redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.63k stars 585 forks source link

Add support for max.compaction.lag.ms #998

Open weeco opened 3 years ago

weeco commented 3 years ago

Kafka 2.3.0 (KIP-354) added the per topic or per broker/cluster option max.compaction.lag.ms which makes it easy to ensure compaction in topics that contain GDPR sensitive data. As far as I see Redpanda does not yet support that. Would it be possible to add this?

JIRA Link: CORE-604

emaxerrno commented 3 years ago

our compaction runs continously in the background - it doesn't stop, and deleting a segment takes 50 microseconds. which would be the lag.ms anything else that we need to do?

emaxerrno commented 3 years ago

if resolved, can you turn this ticket into a discussion.

weeco commented 3 years ago

After a quick discussion with Alex in Slack (https://vectorizedcommunity.slack.com/archives/C01AJDUT88N/p1617137306152000) here's the summary:

Kafka also marks full segments ready for compaction. Kafka is pessimistic about the log. If the log's first message is older than max.compaction.lag.ms it will be marked for compaction. This is handy to ensure that uncompacted messages older than a max.compaction.lag.ms will definetely be compacted instead of relying on the dirty ratio.

Imagine the following usecase: You have a customers topic that is compacted without any retention. Customer UUID is the key. Whenever a customer sends a GDPR request to be deleted we have to ensure that all data belonging to that customer is gone. Thus we send a tombstone for that UUID on the customers topic, but now we have to ensure that Kafka will actually compact away this message in time

tchiotludo commented 2 years ago

Just another use case for this options, with kafka stream:

In a high load environnement, without this configuration, the application is not working as expected as compaction passed before we have the time to read it. IMO, this configuration is mandatory to have in order to handle compacted topic properly and also IMO the default value for topic created by Kafka Stream should have a default value that is not 0

tchiotludo commented 1 year ago

any update here? the lack of this settings prevent most of Kafka Stream with strict delete flag application to work, since redpanda want to have a 1/1 compatibility api, it's a must have

mattschumpert commented 1 year ago

@tchiotludo from the use case above it sounds like you actually need a MIN lag

But we need to be sure that we processed all the data, so we add a default configuration to keep all the data for at least 1 day (that will be our max downtime of the application, If we don't processed the data within the day, we lose data due to compaction)

mattschumpert commented 1 year ago

cc @michael-redpanda

michael-redpanda commented 1 year ago

Going to add @dotnwat to this thread. I think this is in his team's bucket, but will discuss