redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.22k stars 564 forks source link

cloud_storage: bucket scrub #9072

Open jcsp opened 1 year ago

jcsp commented 1 year ago

By design, Redpanda will sometimes leave orphan objects in its object storage bucket. This happens when a node writes a segment, but then unexpectedly loses leadership before it can update the manifest. We do our best to avoid it (https://github.com/redpanda-data/redpanda/pull/8560) but it will happen from time to time.

Like any storage system, to ensure good data hygiene over long storage periods, Redpanda needs a data scrubbing feature. This can be more or less extensive depending on the needs of a given system:

The extreme scrubbing is probably only useful on less-trusted object stores (e.g. if someone uses minio with its basic filesystem backend) -- there is less value in scrubbing a more highly trusted backend like AWS S3.

JIRA Link: CORE-1177

jcsp commented 1 year ago

There's a functional draft of updating the scrubber to clean up orphan segments here: https://github.com/redpanda-data/redpanda/tree/orphan-cleanup

pmw-rp commented 2 months ago

We should ensure this can be disabled, for customers that prefer to have their buckets immutable.