Closed jed326 closed 1 year ago
shard_size
and shard_min_doc_count
are being evaluated at the slice level which is leading to unexpected results in the tests causing failures.
buildAggregations
creates a priority queue with size min(buckets, shard_size)
https://github.com/opensearch-project/OpenSearch/blob/8eea7b986453271cd227a7b98ee09e70d5e28634/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/NumericTermsAggregator.java#L182-L183
shard_size
is being evaluated at the slice level. This manifests in 2 ways. 1) the shards can return up to `shard_size slice_countbuckets to the coordinator, so
shard_sizeis not actually being enforced and 2) the
doc_countof collected buckets may be inaccurate because the same term may exist on multiple slices but due to the
shard_size` limitation the same term bucket is not collected on all slices.**shard_min_doc_count
, however if that's not true for a given slice then a bucket is not created for the term.min(buckets, shard_size)
bucketsI have a few high level solutions in mind. The big problem is with shard_size
. We should have some sort of limit on the number of buckets a slice can collect, but if we ignore shard_size
at the slice level then it could grow unbounded. However, introducing any sort of slice level limit on bucket count (let's call this slice_size
for now), can result in behavioral changes between concurrent and non-concurrent search use cases if the number of ordinals >> the slice_size
. Specifically the doc_count
could be inaccurate and missing some docs from some slices. This is analogous to how shard_size
behaves in the 1 shard vs multiple shard case. This is true even if we introduce slice level settings, which I don't believe we should do because the user doesn't really have any control or knowledge over how documents are distributed over slices so these controls don't really mean anything to them.
Essentially there are a few different ways we can address both shard_min_doc_count
and shard_size
and we can combine the solutions in different ways.
shard_min_doc_count
:
shard_size
:
shard_size
at the slice level, we use some default higher than shard_size
. For example, shard_size
defaults to 1.2*size + 10. Currently my recommendation is [3] for shard_min_doc_count
and [3] for shard_size
. I'm split on if we should reduce the number of buckets down to shard_size
at the shard level reduce. The intention of shard_size
is to give users a way to get more accurate results, so with solution [3] for shard_size
I don't see a purpose in doing the reduce here.
@sohami @reta @andrross Would like to get your input on this. It seems difficult to completely preserve the behavior between concurrent and non-concurrent search cases here.
@jed326 Thanks for the analysis.
The big problem is with shard_size. We should have some sort of limit on the number of buckets a slice can collect, but if we ignore shard_size at the slice level then it could grow unbounded
I didn't understand what you mean my it could grow unbounded. For sequential case as well, the shard_size
limit is applied after document collection is done on the shards and before returning the output (i.e. in buildAggregation
). So if we ignore the shard_size
parameter at slice level but apply it before sending it to the coordinator then still the same bounds can be applied.
I think for both the shard level parameter if we ignore these at slice level and then apply it as part of reduce, it should work as expected. What are the challenges for shard_size
option 4 ?
I didn't understand what you mean my it could grow unbounded. For sequential case as well, the shard_size limit is applied after document collection is done on the shards and before returning the output (i.e. in buildAggregation).
You are right that at the shard level the same bounds would still be applied, so from the coordinator perspective concurrent and non-concurrent search would look the same. However, the unbounded growth I am concerned about here is basically the size of the priority queue at the slice level, which is also the number of buckets being returned at the slice level. Today this is bounded by the shard_size
parameter, but if we ignore that then each slice would return a bucket for every single ordinal, which looks like a serious scaling issue to me. I think this is analogous to why shard_size
parameter exists at all instead of just returning every single bucket to the coordinator and doing the size
bound there.
Subtask for https://github.com/opensearch-project/OpenSearch/issues/7357 to focus on the test failures related to shard size parameter: