When shadow indexing is enabled don't upload all previous data by default

rkruze commented 2 years ago

Who is this for, and what problem do they have today?

When you enable shadow indexing, it will go back and upload all the data in the cluster to S3/GCS. This might have a significant performance impact on the cluster. By default, we should only upload data sent to the cluster after shadow indexing is enabled unless a parameter is set telling Redpanda to upload all previous data to S3/GCS.

What are the success criteria?

When shadow indexing is enabled, only new data is uploaded to S3/GCS.

Why is solving this problem impactful?

If we upload all data by default, we could potentially impact the cluster as we will be reading a large amount of data from the data volume.

JIRA Link: CORE-783

emaxerrno commented 2 years ago

I don't understand how this makes sense. I may move this to a discussion later, but I want to think.

Is fine to upload a TB, if you care about data archival, why wouldn't you care about historical data? Not sure is worth the complexity overhead in the codebase.

emaxerrno commented 2 years ago

I'm moving this out of Shadow Indexing GA until I understand the impact.

jcsp commented 1 year ago

I don't think we should do this. Reasons:

We already have to transfer full partition histories sometimes (e.g. decoms, node adds). At a high level, the size of data on local disk should not be thought of as "too big to make another copy on config changes".
Once we have DeleteRecords (https://github.com/redpanda-data/redpanda/issues/2648) that will provide a straightforward flow for users who want to switch to tiered storage but don't care about retaining data from before a certain point: they can just snip off their log before enabling.
Enforcing local retention policy becomes complex/undefined: if someone enables tiered storage, do we try and keep the local data for the full retention period if it's not in the range uploaded to object storage? Or does local data become eligible for purging at will as is generally the case for tiered storage topics? We could define a behavior here, but it feels really unlikely to be fully understood by users.
Applying policy to manage disk space https://github.com/redpanda-data/redpanda/issues/6438 becomes complex when some tiered storage topics have "special" local data that isn't replicated to object storage.
Doing tooling (e.g. in Console) via the admin API (https://github.com/redpanda-data/redpanda/issues/7920 )to report on what data is/isn't in tiered storage becomes more complex if it the mapping is more complex than "all data prior to offset X is in object storage, all data after offset X is just on local storage".

redpanda-data / redpanda