opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.51k stars 1.75k forks source link

[Feature Request] `terms` query's term lookup should be able to efficiently handle 100k+ (or 1M+) terms #12341

Closed msfroh closed 4 weeks ago

msfroh commented 7 months ago

Is your feature request related to a problem? Please describe

I recently read this blog post, where the author claims a 10x speedup on a large terms query by encoding the field values as a roaring bitmap.

I believe that part of the improvement comes from the use of doc values to post-filter hits that come from a lead iterator, which is now the OpenSearch (starting with 2.12) thanks to @harshavamsi's changes in https://github.com/opensearch-project/OpenSearch/pull/11209 to support IndexOrDocValuesQuery for all numeric query types. (Behind the scenes, Lucene implements the DV query using a LongHashSet, which I think should perform similarly to RoaringBitmap.)

The more interesting part (IMO) is that the roaring bitmap of numeric terms gets created on the client and sent as a base64-encoded bitset, where it's used as the doc value filter.

Similarly, we have the terms lookup feature on the terms query, but it's doing a kind-of naive "fetch an array of strings" approach.

My idea is to borrow the roaring bitmap idea from the linked blog post and add that to terms query's lookup.

Describe the solution you'd like

I would like to modify the terms lookup feature to add a new (opt-in) "protocol" between the main index and the term lookup index. The term lookup index should assign consistent, increasing ordinals to the values in the terms field. When the main index queries the term lookup index, it should pass a bitset of the ordinals whose values it "knows" (cached from previous requests). After finding the matching document in the term lookup index, the response should carry a bitset of matching ordinals, along with the ordinal-to-value mapping for any unknown values. This should allow us to carry very large sets of IDs across the index boundary in a compact representation.

As a next step, the term lookup should support multiple lookup keys and Boolean/bitset operations between them. The term lookup index will return the bitset after performing the Boolean operations. (The term lookup index may cache the result of these Boolean operations.)

Related component

Search:Query Capabilities

Describe alternatives you've considered

I drafted (on my computer) a whole idea around a new query type that would work with numeric ID bitsets, either passed in the query or stored in a custom data store (probably implemented as an OpenSearch index, but could be somewhere else).

My concerns with that were:

  1. It forces folks to use numeric IDs, which may not be a viable option.
  2. It would have added lots of new APIs (to organize, create, and manage bitsets), as well as a whole new query type.

Making the existing lookup feature of terms query "smarter" feels like a lot less work for me and for users.

Additional context

As I mentioned to above, I had drafted a proposal on my computer to build a whole dedicated API. While I no longer think that's the right move, my proposal did have some nice examples of possible use-cases:

Example 1 - Digital entitlements

An example would be a digital content entitlement system, with each document in the search index corresponding to a digital product. End-users can purchase access to content. When an end-user searches their library, they should only see content to which they have access. Updating each piece of content whenever a user makes a purchase is not practical, since a single item of content may be owned by many users. Instead, we would like each user

Example 2 - Multi-location retail search

A retail grocery chain offers online ordering and delivery across many store locations. They have a single catalog of products, but each location may carry a different selection of products. Product selection at any store may fluctuate as inventory sells out and new inventory is delivered. When a user in a given city searches for products, they should see items currently available from their local stores. There should be an updatable collection for each store and a user should be able to search across the union of products available from their local stores.

msfroh commented 7 months ago

Of course, if we're concerned about sending massive queries from the client to the server, we can always use the "terms lookup" feature for a terms query.

More broadly, I wonder if there are any other query types where people might send very large sets of data that we could encode more efficiently from the client to the coordinator. Vector search, maybe?

peternied commented 7 months ago

[Triage - attendees 1 2 3 4 5] @msfroh Thanks for creating this issue; however, it isn't being accepted due to not having a clear outcome - seems more like an RFC would be a clearer starting point. Please feel free to open a new issue after addressing the reason.

msfroh commented 4 months ago

I added a clearer outcome to the description and assigned the issue to myself.

ochrist-eis commented 3 months ago

This would be a great enhancement (and differentiator). We, like others, have a use case which does not scale using terms or terms lookup, and joins don't perform well enough either. The ability to quickly test whether an integer record field value is included in a large list of integers (10K..10M in extreme cases) would be extremely valuable and avoid custom plugins. An approach using Roaring Bitmaps with client-side encoded query bitmaps would be even more flexible since one could express arbitrary bitset operations as query constraints and the data would be encoded and transported more efficiently.

bowenlan-amzn commented 3 months ago

I'm starting to dive into this issue and taking guidance from @msfroh.