redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.43k stars 580 forks source link

When shadow indexing is enabled don't upload all previous data by default #2918

Open rkruze opened 2 years ago

rkruze commented 2 years ago

Who is this for, and what problem do they have today?

When you enable shadow indexing, it will go back and upload all the data in the cluster to S3/GCS. This might have a significant performance impact on the cluster. By default, we should only upload data sent to the cluster after shadow indexing is enabled unless a parameter is set telling Redpanda to upload all previous data to S3/GCS.

What are the success criteria?

When shadow indexing is enabled, only new data is uploaded to S3/GCS.

Why is solving this problem impactful?

If we upload all data by default, we could potentially impact the cluster as we will be reading a large amount of data from the data volume.

JIRA Link: CORE-783

emaxerrno commented 2 years ago

I don't understand how this makes sense. I may move this to a discussion later, but I want to think.

Is fine to upload a TB, if you care about data archival, why wouldn't you care about historical data? Not sure is worth the complexity overhead in the codebase.

emaxerrno commented 2 years ago

I'm moving this out of Shadow Indexing GA until I understand the impact.

jcsp commented 1 year ago

I don't think we should do this. Reasons: