sphinx-doc / sphinx

The Sphinx documentation generator
https://www.sphinx-doc.org/
Other
6.46k stars 2.1k forks source link

[HTML search] optimization: don't loop over all document terms and title terms during partial-matching. #12045

Closed jayaddison closed 1 month ago

jayaddison commented 7 months ago

Is your feature request related to a problem? Please describe. There seems to be a potentially-large inefficiency in the Sphinx JavaScript search code: for any query word greater than length two (likely to be >90% of them, I'd guess!), we iterate through all document terms and all title terms in the client's search index to check for partial matches.

However, if an exact-match was already found on the query word, then we won't add any of those candidates, even when they do match. That means that we spend JavaScript compute resources iterating through items that are unused. Since Sphinx 7.3.0, this is no longer true thanks to #11958 -- if an exact-match on a stemmed term is found, we skip checks for partial matches for that query term.

The relevant code is found here: https://github.com/sphinx-doc/sphinx/blob/cf7d2759af0852d67288e58d823d51fe860749ca/sphinx/themes/basic/static/searchtools.js#L469-L476

I don't have any stats on the performance impact of this, but it feels like it may be significant, especially for large documentation projects. Initial results appear to indicate a 9ms vs 48ms difference on a local test machine.

Describe the solution you'd like

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context

jayaddison commented 7 months ago

Resolved by #11958.

jayaddison commented 2 months ago

Re-opening because I'm puzzling over this potential performance problem. Some thoughts:

Edit: and a second item questioning how often the problem occurs in practice.

wlach commented 2 months ago

Out of curiosity, I ran a search through Chrome's devtools profiler and here were the results:

image

The javascript portion of the search is around 16ms on my Dell i7. No doubt it would be a little slower on older machines, but I doubt by much.

A more significant optimization would likely be to fetch results for the search previews in parallel. That seems to be the main bottleneck to showing a full result.

jayaddison commented 2 months ago

Thanks @wlach - I'll try to do some performance profiling soon too. Could you share the query you used when collecting that profile trace? (it seems that the Python 3[.12?] docs were the dataset?)

Some thoughts:

I also had a weird idea about substring matching: if we sorted the characters within the terms (so foobar => abfoor), and then stored the list of original tokens only when there is more than one source word for the term (so abfoor => [1, 2, 5], loop => {pool: [2, 3], loop: [3, 4]} then I think we could achieve fairly-efficient inner-string substring matches and use binary search, and hopefully not bloat the index size too much.

AA-Turner commented 2 months ago

Could you share the query you used when collecting that profile trace?

From the URL in the screenshot; https://docs.python.org/3.13/search.html?q=asyncio

jayaddison commented 2 months ago

Thank you @AA-Turner!

So for that example, the code path I'm wary of would not currently be followed, because an exact-match does exist:

>>> import json
>>> index = json.loads(open('searchindex.json', 'r').read())  # data from searchindex.js:setIndex
>>> len(index['terms'])
40195
>>> 'asyncio' in index['terms']
True
>>>

That's good news. What I'd like to examine next is the cost of the substring matching (complete terms iteration/matching) for substrings that don't exist as terms in the index.

jayaddison commented 2 months ago

That's good news. What I'd like to examine next is the cost of the substring matching (complete terms iteration/matching) for substrings that don't exist as terms in the index.

For me on an RPi400 (Cortex-A72) the difference within performTermsSearch is fairly significant between exact-match (asyncio - not expected to loop all 40k terms) and substring-match (synci - expected to full-loop the 40k):

Query for asyncio

image Traced running time 9.0ms

Query for synci

image Traced running time 48.0ms

Some repeat experimental trials might help to confirm this.

jayaddison commented 2 months ago

I also had a weird idea about substring matching: if we sorted the characters within the terms (so foobar => abfoor), and then stored the list of original tokens only when there is more than one source word for the term (so abfoor => [1, 2, 5], loop => {pool: [2, 3], loop: [3, 4]} then I think we could achieve fairly-efficient inner-string substring matches and use binary search, and hopefully not bloat the index size too much.

For the Python documentation sources, I anticipate the size increase in the Sphinx search index to support the proposed efficient substring matching approach above would be ~25%, based on:

>>> index = json.loads(open('searchindex.json', 'r').read())  # data from search
index.js:setIndex
>>> len(index['terms'])
40195
>>> len(set("".join(sorted(term)) for term in index['terms']))
30881
jayaddison commented 2 months ago

I also had a weird idea about substring matching: if we sorted the characters within the terms (so foobar => abfoor), and then stored the list of original tokens only when there is more than one source word for the term (so abfoor => [1, 2, 5], loop => {pool: [2, 3], loop: [3, 4]} then I think we could achieve fairly-efficient inner-string substring matches and use binary search, and hopefully not bloat the index size too much.

For the Python documentation sources, I anticipate the size increase in the Sphinx search index to support the proposed efficient substring matching approach above would be ~25%, based on:

>>> index = json.loads(open('searchindex.json', 'r').read())  # data from search
index.js:setIndex
>>> len(index['terms'])
40195
>>> len(set("".join(sorted(term)) for term in index['terms']))
30881

Agh, ignore me. This idea for an index structure doesn't help for the inner-string substring matching problem at all -- because the substring isn't guaranteed to contain the lowest-ordinal character -- the needle that we'd use to begin and navigate the binary search stage.

Back to the drawing board for now... ngrams do seem to be a valid alternative here.

jayaddison commented 2 months ago

Ok, here's an alternative strategy.

The algorithm to perform substring matching itself would be to iterate (in JavaScript) over the trigrams of the query term, and to successively set-intersect on the list of terms found in the ngram-term lookaside trie, until either the intersection set is empty (no results), or we reach the end of the query term.

I think that evaluating the technique on the terms index before deciding whether it's worthwhile to also apply to the titles index (generally much smaller, so less likely to exhibit noticeable performance improvements) could be worthwhile.

Edit: clarification for querying algorithm. Edit: s/notable/noticeable

electric-coder commented 2 months ago

Would an ideal solution be a suffix tree or even suffix automaton? Because:

jayaddison commented 2 months ago

Would an ideal solution be a suffix tree or even suffix automaton? Because:

Possibly? :smile: I'm not familiar with either of those, but I'll read about them. There are going to be some tradeoffs with any search approach, I think. Always good to learn about additional options though.

* For _search-as-you-type_, ideally you wouldn't wanted want to limit users to having to start at the beginning of the string.

That seems true, yep. The ngram-index implementation in #12596 isn't limited to prefix-matches, incidentally -- there are ngrams in there that begin in the middle of strings and could therefore be used by autosuggest for those kind of queries.

* Nor would you want to limit user's searches to a fixed size ngram (they might want to search for whole words, or long substrings starting in the middle of the word).

We definitely don't want to be super-restrictive on the query words/terms. However, using 3-grams doesn't rule out query-matching for words of length 4, for example. If the word abcd exists in the document sources, that can become the 3-grams abc and bcd. If a user types in the query abcd -- we can check whether both abc and bcd exist (they do, in this case), and if so we can perform further checks to see whether they point to a matching document/term (again, in this case they do).

jayaddison commented 1 month ago

I'd like to revisit the JavaScript search code again in future, but I think that this bug seems to have been overly-specific (there haven't been reports of performance issues with this particular area of the code). Closing this as wontfix/not-planned.