Support searches with punctuation

By default, RediSearch breaks up a string like "active-active" into multiple tokens, splitting on punctuation. For users of redis-sitesearch, this means that if you search for "active-active," you get back results for "active" and not necessarily "active-active."

To support allowing searches like "active-active," we could escape all punctuation when we index documents and escape all punctuation on queries. The problem with this approach is that it indexes hyphenated terms as literal tokens. For example, consider a string like "Flash-based." The default behavior, to tokenize this string into two tokens, "flash" and "based," is ideal, because this document is probably a good match for searches of "flash." If we index all punctuation, then we'll index a single token, "flash-based," instead of "flash" and "based." So in general, supporting punctuation for all queries would reduce accuracy.

The solution? Allow sites to configure a list of "literal tokens." If we find any of these tokens in a document while indexing or in a search query, we'll escape the punctuation. This allows a site to configure "active-active" as a literal token. We'll then escape the term "active-active" as "active-active" when we index and do the same on queries, allowing those queries to find exactly the documents that contain "active-active."

redislabs-training / redis-sitesearch

Support searches with punctuation #4