Support bucketed prefix for object in non-S3 object store backend

hzxa21 commented 6 months ago

In S3 object store backend, we follow the best practice to hash the objects into 256 prefixes to enable AWS request rate auto scaling. For non-S3 object store backend, we now use a single prefix for all objects.

However, bucketed prefix is known to be useful in some non-S3 backends as well.

GCS performs request rate scaling based on prefix according to their doc:

For example, a 1-character prefix using a random hex value provides effective auto-scaling from the initial 5000/1000 reads/writes per second up to roughly 80000/16000 reads/writes per second, because the prefix has 16 potential values.
The metadata load can be alleviated in fs/hdfs by putting files into several sub-directories instead of a single directory.

There may be more examples. Currently the highest priority among all these backends is to support bucketed prefix in GCS because we have seen user workload hitting the GCS single prefix limit recently.

hzxa21 commented 6 months ago

@Li0k and I have an offline discussion on how to support bucketed prefix and adjust number of prefix buckets in general. Here is the summary.

Requirements & Assumptions

Prefix bucket adjustment is considered as a very infrequent operation so we should keep the mechanism simple and it is okay to have manual intervention.
Backward compatibility should be maintained so that objects written prior to the adjustment can still be read without any issue.
Forward compatibility is not guaranteed, meaning that old codes may not be able to read objects written after the adjustment.
Once done, the prefix bucket adjustment cannot be reverted.
Our object id generator generates monotonically increasing IDs.

Design

Introduce a system parameter to store a mapping from start_object_id to number_of_prefix_buckets. Objects with ID > start_object_id will use the corresponding number_of_prefix_buckets to calculate the prefix bucket in its path. Examples:
```
Mapping:
10 -> 1
1000 -> 64
2000 -> 256
```

Object [0, 10) will all be assigned to a single prefix. Object [10, 1000) will be hashed and distributed evenly across 64 prefixes. Object [2000, inf) will be hashed and distributed evenly across 256 prefixes.



- For simplicity, CNs and Compactors will only read this mapping on startup and we don't support runtime reload. Future read/write on objects will follow what is specified above.
- For new clusters created after this feature is done, the mapping is initialized with `start_object_id=0` and a reasonable prefix bucker number (e.g.  0 -> 256), which means new clusters can enjoy the benefits without going through the upgrade steps.
- For existing cluster created prior to this feature on non-S3 backend, the mapping is initialized to be empty, which means the old behavior is maintained. User/DevOps need to follow the upgrade steps to adjust the number of prefix buckets accordingly. This is because it is only safe to update the mapping when there are no ongoing reads/writes in the cluster. For simplicity, we decided to introduce manual intervention here.

### Upgrade steps
1. Take a [meta backup](https://docs.risingwave.com/docs/current/meta-backup/) of the cluster.
2. Upgrade the image for all nodes including meta,CN,compactor in the cluster.
3. Pick a right time to ban object store reads/writes from CNs and compactors. This can be done by simply killing all CNs and Compactors at once.
4. Issue `ALTER` command for the mapping. On receiving this command, meta will use the largest object id in HummockVersion as the `start_object_id` and write a new entry in the mapping. This is safe because HummockVersion will not be changed (due to 3).
5. Restart CNs and compactors.

Ideally, operation 3, 4, 5 should be done very quickly to minimize cluster unavailability.

hzxa21 commented 2 months ago

Given that we don't see a strong need to migrate the prefix strategy for non-s3 backend, I think we will actually start working on the design. I will leave the issue open but remove it from milestone until we see a strong need.

risingwavelabs / risingwave

Support bucketed prefix for object in non-S3 object store backend #15667

Requirements & Assumptions

Design