[HTML search] optimization: don't loop over all document terms and title terms during partial-matching.

jayaddison commented 7 months ago

Is your feature request related to a problem? Please describe. There seems to be a potentially-large inefficiency in the Sphinx JavaScript search code: for any query word greater than length two (likely to be >90% of them, I'd guess!), we iterate through all document terms and all title terms in the client's search index to check for partial matches.

However, if an exact-match was already found on the query word, then we won't add any of those candidates, even when they do match. That means that we spend JavaScript compute resources iterating through items that are unused. Since Sphinx 7.3.0, this is no longer true thanks to #11958 -- if an exact-match on a stemmed term is found, we skip checks for partial matches for that query term.

The relevant code is found here: https://github.com/sphinx-doc/sphinx/blob/cf7d2759af0852d67288e58d823d51fe860749ca/sphinx/themes/basic/static/searchtools.js#L469-L476

~~I don't have any stats on the performance impact of this, but it feels like it may be significant, especially for large documentation projects.~~ Initial results appear to indicate a 9ms vs 48ms difference on a local test machine.

Describe the solution you'd like

~~If we've already found exact-matches for a document term, then do not iterate over all document terms to check for partial matches.~~
~~If we've already found exact-matches for a document title term, then do not iterate over all document title terms to check for partial matches.~~
Find a non-brute-force algorithm for partial substring matching on terms and titles.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context

Extracted from #11957.

jayaddison commented 7 months ago

Resolved by #11958.

jayaddison commented 2 months ago

Re-opening because I'm puzzling over this potential performance problem. Some thoughts:

Perhaps it's not a relevant/useful problem to work on, because most searches involve a fresh page-load and often also include additional HTTP GET requests to load search snippets (search summaries) -- and the time overhead involved in those may make these loops miniscule.
Perhaps enough user query terms do find at least one exact-match term, meaning that this problem is not encountered frequently.
The Python docs, as an example large-ish docbase, do contain over 40k terms, from a quick check I ran, and that feels like a lot to me. Yes, most modern compute devices (mobile or desktop) are 'fast', but even so, removing that kind of iteration might improve perceived performance (and energy efficiency).
ngrams -- and in particular in this case, trigrams or larger -- could support the existing functionality. My personal opinion is that they might be finnicky and annoying to work with in terms of code complexity, though. They could unlock search-as-you-type as a side-effect -- but might also make for larger sized search indexes (again as reference: Python docs' search index is currently ~3.7MB uncompressed. ngrams would introduce an increase on that I expect, even if we are able to drop the terms/title-terms indexes in the process).

Edit: and a second item questioning how often the problem occurs in practice.

wlach commented 2 months ago

Out of curiosity, I ran a search through Chrome's devtools profiler and here were the results:

The javascript portion of the search is around 16ms on my Dell i7. No doubt it would be a little slower on older machines, but I doubt by much.

A more significant optimization would likely be to fetch results for the search previews in parallel. That seems to be the main bottleneck to showing a full result.

jayaddison commented 2 months ago

Thanks @wlach - I'll try to do some performance profiling soon too. Could you share the query you used when collecting that profile trace? (it seems that the Python 3[.12?] docs were the dataset?)

Some thoughts:

We may want to expand the HTML layout space (sorry, my terminology is not great) for each result if we retrieve summaries/snippets asynchronously, perhaps with some grayed-out blocky text-like curved-edge rectangular placeholder thingies so that users have an idea that we're preparing to fill the space with each summary as it loads, without the results 'jumping' when that happens.
Although we should retrieve the HTTP GET results async, we'll want to join/barrier those threads so that all the results appear at the same time. Similarly we should be careful that time-of-arrival doesn't affect order-of-appearance.
It might be clear already, but just to state it anyway: the HTTP GET requests should occur as early as possible. We do need to perform some matching first, but any work that we can delay into the time window where those network requests are in-flight should result in wall-clock time saved.

I also had a weird idea about substring matching: if we sorted the characters within the terms (so foobar => abfoor), and then stored the list of original tokens only when there is more than one source word for the term (so abfoor => [1, 2, 5], loop => {pool: [2, 3], loop: [3, 4]} then I think we could achieve fairly-efficient inner-string substring matches and use binary search, and hopefully not bloat the index size too much.

AA-Turner commented 2 months ago

Could you share the query you used when collecting that profile trace?

From the URL in the screenshot; https://docs.python.org/3.13/search.html?q=asyncio

jayaddison commented 2 months ago

Thank you @AA-Turner!

So for that example, the code path I'm wary of would not currently be followed, because an exact-match does exist:

>>> import json
>>> index = json.loads(open('searchindex.json', 'r').read())  # data from searchindex.js:setIndex
>>> len(index['terms'])
40195
>>> 'asyncio' in index['terms']
True
>>>

That's good news. What I'd like to examine next is the cost of the substring matching (complete terms iteration/matching) for substrings that don't exist as terms in the index.

jayaddison commented 2 months ago

That's good news. What I'd like to examine next is the cost of the substring matching (complete terms iteration/matching) for substrings that don't exist as terms in the index.

For me on an RPi400 (Cortex-A72) the difference within performTermsSearch is fairly significant between exact-match (asyncio - not expected to loop all 40k terms) and substring-match (synci - expected to full-loop the 40k):

Query for `asyncio`

Traced running time 9.0ms

Query for `synci`

Traced running time 48.0ms

Some repeat experimental trials might help to confirm this.

jayaddison commented 2 months ago

I also had a weird idea about substring matching: if we sorted the characters within the terms (so foobar => abfoor), and then stored the list of original tokens only when there is more than one source word for the term (so abfoor => [1, 2, 5], loop => {pool: [2, 3], loop: [3, 4]} then I think we could achieve fairly-efficient inner-string substring matches and use binary search, and hopefully not bloat the index size too much.

For the Python documentation sources, I anticipate the size increase in the Sphinx search index to support the proposed efficient substring matching approach above would be ~25%, based on:

>>> index = json.loads(open('searchindex.json', 'r').read())  # data from search
index.js:setIndex
>>> len(index['terms'])
40195
>>> len(set("".join(sorted(term)) for term in index['terms']))
30881

jayaddison commented 2 months ago

I also had a weird idea about substring matching: if we sorted the characters within the terms (so foobar => abfoor), and then stored the list of original tokens only when there is more than one source word for the term (so abfoor => [1, 2, 5], loop => {pool: [2, 3], loop: [3, 4]} then I think we could achieve fairly-efficient inner-string substring matches and use binary search, and hopefully not bloat the index size too much.

For the Python documentation sources, I anticipate the size increase in the Sphinx search index to support the proposed efficient substring matching approach above would be ~25%, based on:
>>> index = json.loads(open('searchindex.json', 'r').read())  # data from search
index.js:setIndex
>>> len(index['terms'])
40195
>>> len(set("".join(sorted(term)) for term in index['terms']))
30881

Agh, ignore me. This idea for an index structure doesn't help for the inner-string substring matching problem at all -- because the substring isn't guaranteed to contain the lowest-ordinal character -- the needle that we'd use to begin and navigate the binary search stage.

Back to the drawing board for now... ngrams do seem to be a valid alternative here.

jayaddison commented 2 months ago

Ok, here's an alternative strategy.

Challenge: we want efficient substring matching for terms because that index is large and we're full-scanning it when performing partial matching.
Response: create an index of term substrings using ngrams (3-grams / trigrams in particular) for efficient superstring (containing term) lookup.
Next challenge: we want to minimize code complexity introduced.
Response: place the trigrams in a lookaside index that is used solely for trigram-to-term lookups.
Next challenge: we also want to minimize index size bloat introduced.
Response: we can map the trigrams to term offset (integer) if we build them after the term index is built (in Python), and we could store the trigrams in a three-level key-value structure (basically a trie datastructure, in JSON). These should minimize the overhead.

The algorithm to perform substring matching itself would be to iterate (in JavaScript) over the trigrams of the query term, and to successively set-intersect on the list of terms found in the ngram-term lookaside trie, until either the intersection set is empty (no results), or we reach the end of the query term.

I think that evaluating the technique on the terms index before deciding whether it's worthwhile to also apply to the titles index (generally much smaller, so less likely to exhibit noticeable performance improvements) could be worthwhile.

Edit: clarification for querying algorithm. Edit: s/notable/noticeable

electric-coder commented 2 months ago

Would an ideal solution be a suffix tree or even suffix automaton? Because:

For search-as-you-type, ideally you wouldn't wanted want to limit users to having to start at the beginning of the string.
Nor would you want to limit user's searches to a fixed size ngram (they might want to search for whole words, or long substrings starting in the middle of the word).

jayaddison commented 2 months ago

Would an ideal solution be a suffix tree or even suffix automaton? Because:

Possibly? :smile: I'm not familiar with either of those, but I'll read about them. There are going to be some tradeoffs with any search approach, I think. Always good to learn about additional options though.

* For _search-as-you-type_, ideally you wouldn't wanted want to limit users to having to start at the beginning of the string.

That seems true, yep. The ngram-index implementation in #12596 isn't limited to prefix-matches, incidentally -- there are ngrams in there that begin in the middle of strings and could therefore be used by autosuggest for those kind of queries.

* Nor would you want to limit user's searches to a fixed size ngram (they might want to search for whole words, or long substrings starting in the middle of the word).

We definitely don't want to be super-restrictive on the query words/terms. However, using 3-grams doesn't rule out query-matching for words of length 4, for example. If the word abcd exists in the document sources, that can become the 3-grams abc and bcd. If a user types in the query abcd -- we can check whether both abc and bcd exist (they do, in this case), and if so we can perform further checks to see whether they point to a matching document/term (again, in this case they do).

jayaddison commented 1 month ago

I'd like to revisit the JavaScript search code again in future, but I think that this bug seems to have been overly-specific (there haven't been reports of performance issues with this particular area of the code). Closing this as wontfix/not-planned.

sphinx-doc / sphinx