risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.87k stars 569 forks source link

Support bucketed prefix for object in non-S3 object store backend #15667

Open hzxa21 opened 6 months ago

hzxa21 commented 6 months ago

In S3 object store backend, we follow the best practice to hash the objects into 256 prefixes to enable AWS request rate auto scaling. For non-S3 object store backend, we now use a single prefix for all objects.

However, bucketed prefix is known to be useful in some non-S3 backends as well.

There may be more examples. Currently the highest priority among all these backends is to support bucketed prefix in GCS because we have seen user workload hitting the GCS single prefix limit recently.

hzxa21 commented 6 months ago

@Li0k and I have an offline discussion on how to support bucketed prefix and adjust number of prefix buckets in general. Here is the summary.

Requirements & Assumptions

Design

Object [0, 10) will all be assigned to a single prefix. Object [10, 1000) will be hashed and distributed evenly across 64 prefixes. Object [2000, inf) will be hashed and distributed evenly across 256 prefixes.



- For simplicity, CNs and Compactors will only read this mapping on startup and we don't support runtime reload. Future read/write on objects will follow what is specified above.
- For new clusters created after this feature is done, the mapping is initialized with `start_object_id=0` and a reasonable prefix bucker number (e.g.  0 -> 256), which means new clusters can enjoy the benefits without going through the upgrade steps.
- For existing cluster created prior to this feature on non-S3 backend, the mapping is initialized to be empty, which means the old behavior is maintained. User/DevOps need to follow the upgrade steps to adjust the number of prefix buckets accordingly. This is because it is only safe to update the mapping when there are no ongoing reads/writes in the cluster. For simplicity, we decided to introduce manual intervention here.

### Upgrade steps
1. Take a [meta backup](https://docs.risingwave.com/docs/current/meta-backup/) of the cluster.
2. Upgrade the image for all nodes including meta,CN,compactor in the cluster.
3. Pick a right time to ban object store reads/writes from CNs and compactors. This can be done by simply killing all CNs and Compactors at once.
4. Issue `ALTER` command for the mapping. On receiving this command, meta will use the largest object id in HummockVersion as the `start_object_id` and write a new entry in the mapping. This is safe because HummockVersion will not be changed (due to 3).
5. Restart CNs and compactors.

Ideally, operation 3, 4, 5 should be done very quickly to minimize cluster unavailability. 
hzxa21 commented 2 months ago

Given that we don't see a strong need to migrate the prefix strategy for non-s3 backend, I think we will actually start working on the design. I will leave the issue open but remove it from milestone until we see a strong need.