Local retention kicked in after an upgrade to v22.3 (from v22.2) and deleted some local segments,
although the starting retention.bytes|ms did not ask for it (on upgrade retention.bytes|ms gets migrated to retention.local.target.bytes|ms if cloud storage is enabled).
Retention is adjusted for all topics in the cluster by a migrator (src/v/features/migrators/cloud_storage_config.cc). This is a piece of code that runs on the controller leader as one of the final steps on start-up. Notably it requires quorum of the controller log to run (otherwise it waits).
The problem was that other logs will start and apply retention before the migrator had a chance to run. In this case, the default local retention is applied (24h). Basically, there's a time window between a given local log starts and the migrator adjusts the retention configs when the local log retention is defaulted.
Note that this only happens when cloud storage is enabled for the topic. It's not data loss because only data that has been previously uploaded can be removed from the local log.
What should have happened instead?
Retention configs should migrate before local retention kicks in.
How to reproduce the issue?
It's tricky to repro since it requires a slow start-up. It should repro by only upgrading one node
and waiting for local retention to kick in. At this point the migration will not have been done yet
as the cloud_retention feature is not active yet.
Version & Environment
Redpanda version: (use
rpk version
): v22.3What went wrong?
Local retention kicked in after an upgrade to v22.3 (from v22.2) and deleted some local segments, although the starting
retention.bytes|ms
did not ask for it (on upgraderetention.bytes|ms
gets migrated toretention.local.target.bytes|ms
if cloud storage is enabled).Retention is adjusted for all topics in the cluster by a migrator (src/v/features/migrators/cloud_storage_config.cc). This is a piece of code that runs on the controller leader as one of the final steps on start-up. Notably it requires quorum of the controller log to run (otherwise it waits).
The problem was that other logs will start and apply retention before the migrator had a chance to run. In this case, the default local retention is applied (24h). Basically, there's a time window between a given local log starts and the migrator adjusts the retention configs when the local log retention is defaulted.
Note that this only happens when cloud storage is enabled for the topic. It's not data loss because only data that has been previously uploaded can be removed from the local log.
What should have happened instead?
Retention configs should migrate before local retention kicks in.
How to reproduce the issue?
It's tricky to repro since it requires a slow start-up. It should repro by only upgrading one node and waiting for local retention to kick in. At this point the migration will not have been done yet as the
cloud_retention
feature is not active yet.JIRA Link: CORE-1303