Open itiyama opened 2 years ago
@adnapibar Could you please take a look? since this related to streaming index API
With segment replication and remote storage where the failovers are slow, this optimization can be combined with an adaptive shard selection based on latencies or shard availability.
@nknize Your ghost writer solution will address this problem too, right?
Automatic routing is one way to solve this, but automatic routing makes updates/gets inefficient. The issue that our solution should solve is that id is tightly coupled with routing. Here are a few options to decouple the two, that handle updates and custom document ids.
This solution does not prevent us from using custom doc ids or custom routing - it is just that the optimization does not work when custom doc id or routing is used.
Adaptive shard selection for indexing - To implement this, each shard returns the number of documents indexed within each shard for auto generated id, the size per shard to coordinator and also stats on shard request queue, processing time etc. Based on this, requests using auto generated ids can adaptively select the shard the document lands into. Each coordinator will calculate a balance score based on documents with auto ids within each shard and once the score starts breaching a threshold, the adaptive shard selection will attempt to fix the balance.
We will use an opt-in to enable this feature and will be enabled by default for data streams
When there are multiple parallel bulk requests from a client with no document id, the coordinator does the following:
If any of the shards is slow, all the bulk requests need to wait on the coordinator. The client is also blocked till the coordinator return the response. When the request is returned to the customer, the customer reads the bulk response and then retries the remaining requests. Imagine a shard being in INITIALIZING state - all the bulk requests will wait on the coordinator and cause the entire system to be slow.
How about the coordinators always send all document in a bulk request to one shard and then round robin the requests across shards? This will reduce the amount of requests waiting in the coordinator queue as a result of a slow shard and also free up the resources on the client side.
What are the downsides of this approach? Can this result in imbalance of documents across shards? If a shard is really slow, it would be imbalanced even with a uniform splitting approach as it will not be able to complete the work on time and hence would be timed out on coordinator.
This optimization will not work for customer generated ids or for custom routing use-cases.