redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.65k stars 589 forks source link

storage: overly eager local retention on upgrades to v22.3 #10739

Closed VladLazar closed 7 months ago

VladLazar commented 1 year ago

Version & Environment

Redpanda version: (use rpk version): v22.3

What went wrong?

Local retention kicked in after an upgrade to v22.3 (from v22.2) and deleted some local segments, although the starting retention.bytes|ms did not ask for it (on upgrade retention.bytes|ms gets migrated to retention.local.target.bytes|ms if cloud storage is enabled).

Retention is adjusted for all topics in the cluster by a migrator (src/v/features/migrators/cloud_storage_config.cc). This is a piece of code that runs on the controller leader as one of the final steps on start-up. Notably it requires quorum of the controller log to run (otherwise it waits).

The problem was that other logs will start and apply retention before the migrator had a chance to run. In this case, the default local retention is applied (24h). Basically, there's a time window between a given local log starts and the migrator adjusts the retention configs when the local log retention is defaulted.

Note that this only happens when cloud storage is enabled for the topic. It's not data loss because only data that has been previously uploaded can be removed from the local log.

What should have happened instead?

Retention configs should migrate before local retention kicks in.

How to reproduce the issue?

It's tricky to repro since it requires a slow start-up. It should repro by only upgrading one node and waiting for local retention to kick in. At this point the migration will not have been done yet as the cloud_retention feature is not active yet.

JIRA Link: CORE-1303

VladLazar commented 1 year ago

10780 contains a test that reproduces this issue.