redislabs-training / redis-sitesearch

Real-time search and indexing for any website
MIT License
33 stars 11 forks source link

Support searches with punctuation #4

Closed abrookins closed 3 years ago

abrookins commented 3 years ago

By default, RediSearch breaks up a string like "active-active" into multiple tokens, splitting on punctuation. For users of redis-sitesearch, this means that if you search for "active-active," you get back results for "active" and not necessarily "active-active."

To support allowing searches like "active-active," we could escape all punctuation when we index documents and escape all punctuation on queries. The problem with this approach is that it indexes hyphenated terms as literal tokens. For example, consider a string like "Flash-based." The default behavior, to tokenize this string into two tokens, "flash" and "based," is ideal, because this document is probably a good match for searches of "flash." If we index all punctuation, then we'll index a single token, "flash-based," instead of "flash" and "based." So in general, supporting punctuation for all queries would reduce accuracy.

The solution? Allow sites to configure a list of "literal tokens." If we find any of these tokens in a document while indexing or in a search query, we'll escape the punctuation. This allows a site to configure "active-active" as a literal token. We'll then escape the term "active-active" as "active-active" when we index and do the same on queries, allowing those queries to find exactly the documents that contain "active-active."